Machine Learning ================ ```{article-info} :avatar: https://avatars.githubusercontent.com/u/3275593?s=80&v=4 :avatar-link: https://pradyunsg.me :avatar-outline: muted :author: Pradyun Gedam :date: Aug 15, 2021 :read-time: 5 min read ``` This page contains a detailed analyses on SoA Machine Learning (ML) that are the pillars for the Beyond Vision developments. It will focus mainly on DL, but it’s important to understand the different fields inside AI and organize them in such a way that it’s possible to find parallelism between subsets and extract advantages from their differences :cite:p:`Domingos2012`. .. figure:: ../_static/images/learning/AI_ML_DL_cropped.webp :alt: AI_ML_DL_cropped Artificial Intelligence (AI) AI is a field of computer science that aims to make computers achieve human-style intelligence. As represented in figure, ML is a subset of AI, which contains a subset that try to replicate the human-brain called NN. NN contains large neural models which finally get us to the field of DL. ML is a set of related techniques in which computers are trained to perform a particular task rather than by explicitly programming them. ML algorithms can be used to infer relationships and extract knowledge from gathered data. A NN is a construction type in ML inspired by the network of neurons (nerve cells) in the biological brain. NN are a fundamental part of DL, and will be covered in this page. Finally there is DL, which is a subfield of ML, that uses multi-layered neural networks. Often, ML and DL are used interchangeably. Initially, we will go through the three main ways of learning, and then try to cluster the ML techniques into five clusters :cite:p:`Domingos2015`. There are main approaches for learning algorithms are SL, UL and RL :cite:p:`Ayodele2010`. - SL consist in obtaining outcome variables (or dependent variables) which are predicted from a given set of predictor variables (data features). Using these set of variables, a function that maps inputs to desired outputs is generated in what is called the training process. The process finishes when the model achieves a desired level of accuracy on the training data. Examples of SL algorithms are: KNN, Random Forest, Decision Tree and Logistic Regression. A sample representation of a SL workflow is illustrated on the figure. On figure the dataset is divided by colors. After training the algorithm correctly classifies each object by it’s characteristic color. .. figure:: ../_static/images/learning/SupervisedLearning.* :alt: SupervisedLearning Supervised Learning. - UL is a data-driven knowledge discovery approach that can automatically infer a function that describes the structure of the analyzed data or can highlight correlations in the data forming different clusters of related data. A UL workflow is depicted at the figure. Examples of algorithms include: K-Means, DBSCAN and Apriori. On the figure no information is given to the algorithm, and it has to discover that the input contains objects of different shapes and colors. Afterwards, he will group the different objects according to it’s similarities. At the end, the output should be 3 clusters, with the different colors. Notice that in the case of SL the algorithm was able to recognize that a given object belong to the color class, whereas on the UL it just as the concept of the object belonging to a different category. .. figure:: ../_static/images/learning/UnsupervisedLearning.webp :alt: UnsupervisedLearning Unsupervised Learning. - Reinforcement Learning algorithms are trained to make specific decisions. The goal is to discover which actions lead to an optimal policy. This is done by learning from past experiences, as represented at the figure. As an example, a target policy is set, for instance the delay of a set of flows in an SDN. Then an algorithm results in actions on the SDN controller that change the configuration and for each action a reward is received, which increases as the in-place policy gets closer to the target policy. Ultimately, the algorithm will learn the set of configuration updates (actions) that result in such target policy (e.g. Markov Decision Process). Compared to SL and UL, RL is slightly different, in the sense he does not intend to map the input to the output. For example, he can try to take actions towards the region area that contains the maximum number of red objects, but he does it undefinability, until he reaches an end function. .. figure:: ../_static/images/learning/ReinforcedLearning.webp :alt: Reinforcement Learning Reinforcement Learning. As described in :cite:p:`Domingos2012`, there are 12 important key points that should be kept in mind when working with ML: #. Learning = Representation + Evaluation + Optimization. #. Representation - A classifier must be represented in a formal language that the computer can handle. Creating a set of classifiers the learner can learn is crucial. #. Evaluation - An Evaluation function is needed to distinguish good classifiers from bad ones. #. Optimization - A method to search among the classifiers in the language for the highest scoring one. The choice of optimization technique is key to the efficiency of the algorithm. #. It is generalization that matters. #. Data alone is not enough. #. Overfitting has many faces. #. Intuition fails in high dimensions. #. Theoretical guarantees are not what they seem. #. Feature engineering is the key. #. More data beats a cleverer algorithm. #. Learn many models not just one. #. Simplicity does not imply accuracy. #. Representable does not imply learnable. #. Correlation does not imply Causation . Machine Learning Fields ----------------------- ML has many subfields, branches, and special techniques. To over simplify — in SL you know what you want to teach the computer, while UL is about letting the computer figure out what can be learned. SL is the most common type of ML the most used at Beyond Vision. The majority of ML algorithms can be clustered into five clusters :cite:p:`Domingos2015`, as summarized on table `1.1 <#table:typesofml>`__ : .. container:: :name: table:typesofml .. table:: Different types of Machine Learning. +----------------+----------------+----------------+----------------+ | Cluster | Origins | Strength | Main Algorithm | +================+================+================+================+ | Symbolist | Logic & | Structure | Inverse | | | philosophy | Inference | deduction | +----------------+----------------+----------------+----------------+ | Connectionists | Neuroscience | Estimating | Neural | | | | Parameters | Networks | +----------------+----------------+----------------+----------------+ | Evolutionaries | Evolutionary | Weighing | Genetic | | | biology | Evidence | programming | +----------------+----------------+----------------+----------------+ | Bayesians | Statistics | Structure | Probabilistic | | | | Learning | Inference | +----------------+----------------+----------------+----------------+ | Analogizers | Psychology | Mapping to | Kernel | | | | Novelty | Machines | +----------------+----------------+----------------+----------------+ The symbolist cluster represent algorithms who believe in discovering new knowledge by filling in the gaps in the knowledge that you already have. They are the ones that most relate to computer science in the five clusters. Their master algorithm is inverse deduction. For the symbolist, learning is the inverse of deduction, which means that learning is the induction of knowledge. In practical terms, they try to create general rules from specific facts. On figure `1.5 <#fig:sg>`__, is a simplistic representation of a typical symbolist algorithm, a decision tree. On this example, a character classifier is presented, where the output are 4 different possible groups. A decision tree has multiple types of nodes :cite:p:`Kaminski2018`. On figure `1.5 <#fig:sg>`__, decision nodes are represented in green and end nodes are represented in blue. The green arrows represent a true evaluation at the node, while the red arrows represent a false evaluation. A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules. .. figure:: ../_static/images/learning/SymbolistsGraphv3.webp :alt: Symbolist Representation Symbolist Representation In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated. Among decision support tools, decision trees (and influence diagrams) have several advantages, such as: - Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation. - Have value even with small datasets. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, or costs) and their preferences for outcomes. - Help determine worst, best and expected values for different scenarios. - Use a white box model. If a given result is provided by a model. - Can be combined with other decision techniques. On the other hand, decision trees have some disadvantages: - They are unstable, meaning that a small change in the data can lead to a large change in the structure of the optimal decision tree. - They are often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to interpret as a single decision tree. - For data that include categorical variables with different number of levels, information gain in decision trees is biased in favor of those attributes with more levels :cite:p:`Deng2011`. - Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked. The evolutionaries, have origins in the evolutionary biology. The main algorithm of this school, is the genetic programming and consist on replicating the process of genetic evolution. As it is illustrated in figure `1.6 <#fig:evolutionaries>`__, it starts from a population of unfit (usually random) elements, and they are iteratively fit for a particular task by applying operations analogous to natural genetic processes to the population. It is essentially a heuristic search technique that searches for an optimal or at least suitable element. The typical operations of a Genetic algorithm are: #. **Selection**: the fittest elements for reproduction (crossover) and mutation are selected according to a predefined fitness measure, usually proficiency at the desired task. #. **Crossover**: involves swapping random parts of selected pairs (parents) to produce new and different offspring that become part of the new generation of elements. #. **Mutation**: consists involves substitution of some random part of a element with some other random part of another element. .. figure:: ../_static/images/learning/EvolutionariesGraphv2.webp :alt: Genetic Algorithm Representation Genetic Algorithm Representation. Some combinations, usually the best ones, are directly copied from the current generation to the new generation, which is usually called elitism. Then the selection and other operations are recursively applied to the new generation of elements. Typically, members of each new generation are on average more fit than the members of the previous generation, and the best-of-generation element is often better than the best-of-generation elements from previous generations. Termination of the recursion is when some individual element reaches a predefined proficiency or fitness level. A branch of Genetic algorithms are considered to be evolutionary bio-inspired, such as GBCA, FSA, CSO, WOA, AAA, ESA, CSOA, MFO and GWO :cite:p:`Darwish2018`. It can be considered that the main advantages of genetic algorithms are: - It can find fit solutions in less time. (fit solutions are solutions which are good according to the defined heuristic). - The random mutation guarantees to some extent that a wide range of solutions is generated. - Coding them is really easy compared to other algorithms. On the other hand, the drawbacks of genetic algorithms are: - It is really hard for people to come up with a good heuristic which actually reflects what the algorithm should do. - It might not find the most optimal solution to the defined problem in all cases. - Its also hard to choose parameters like number of generations, population size or stopping condition. When the model is being worked, even though the heuristic was right, it might be hard to realize it because it’s running for a few generations. The bayesians come from statistics and most of their algorithms are extensions and reformulations of the equation `[eq:bayes] <#eq:bayes>`__. In probability theory and statistics, Bayes’ theorem describes the probability of an event, based on a priori knowledge that may be related to the event. The theorem shows how to change a priori probabilities in view of new evidence to obtain a posteriori probabilities. The core stone is the equation `[eq:bayes] <#eq:bayes>`__ :cite:p:`Kemp1994`, on which :math:`A` and :math:`B` are events, :math:`P(A|B)` is a conditional probability of the likelihood of event :math:`A` occurring given that :math:`B` is true, :math:`P(B|A)` is the conditional probability of the likelihood of event :math:`B` occurring given that :math:`A` is true and finally :math:`P(A)` and :math:`P(B)` are the probabilities of observing :math:`A` and :math:`B` independently of each other. This is known as the marginal probability. .. math:: \label{eq:bayes} \left\{\begin{matrix} P(A|B) = P(A) \frac{P(B |A)}{P(B)} \space\\ P(B) \neq 0 \end{matrix}\right. \space, Some advantages to using Bayesian analysis include the following: - It provides a natural and principled way of combining prior information with data, within a solid decision theoretical framework. You can incorporate past information about a parameter and form a prior distribution for future analysis. When new observations become available, the previous posterior distribution can be used as a prior. All inferences logically follow from Bayes’ theorem. - It provides inferences that are conditional on the data and are exact, without reliance on asymptotic approximation. Small sample inference proceeds in the same manner as of a larger dataset. Bayesian analysis also can estimate any functions of parameters directly, without using the "plug-in" method (a way to estimate functionals by plugging the estimated parameters in the functionals). - It obeys the likelihood principle. If two distinct sampling designs yield proportional likelihood functions for, then all inferences about should be identical from these two designs. Classical inference does not in general obey the likelihood principle. - It provides a convenient setting for a wide range of models, such as hierarchical models and missing data problems. MCMC, along with other numerical methods, makes computations tractable for virtually all parametric models. There are also disadvantages to using Bayesian analysis: - It does not tell you how to select a prior. There is no correct way to choose a prior. Bayesian inferences require skills to translate subjective prior beliefs into a mathematically formulated prior. If you do not proceed with caution, you can generate misleading results. - It can produce posterior distributions that are heavily influenced by the priors. From a practical point of view, it might sometimes be difficult to convince subject matter experts who do not agree with the validity of the chosen prior. - It often comes with a high computational cost, especially in models with a large number of parameters. In addition, simulations provide slightly different answers unless the same random seed is used. Note that slight variations in simulation results do not contradict the early claim that Bayesian inferences are exact. The posterior distribution of a parameter is exact, given the likelihood function and the priors, while simulation-based estimates of posterior quantities can vary due to the random number generator used in the procedures. The analogizers actually have influences from a lot of different fields, being psychology probably the most important to them. The core algorithm for the analogizers is the kernel machines as known as SVM :cite:p:`Cortes1995`, as is exemplified at figure `1.7 <#fig:svmgraph>`__. .. figure:: ../_static/images/learning/SVMGraphV2.webp :alt: Support Vector Machine Representation Support Vector Machine Representation. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on the side of the gap on which they fall. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. When data is unlabeled, SL is not possible, and an UL approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups :cite:p:`Statnikov2011`. In general terms, the main advantages of SVMs can be described as: - SVMs are very good when there’s not much information about the working data. - Works well with even unstructured and semi structured data like text, images and trees. - The kernel trick is the major advantage of SVM. With an appropriate kernel function, it’s possible to solve any complex problem, with few parameters. - Unlike in neural networks, SVM is not solved for local optima. - It scales relatively well to high dimensional data. - SVM models have generalization in practice, the risk of over-fitting is less in SVM. The main SVM disadvantages are :cite:p:`Cawley2010`: - Choosing a “good” kernel function is not easy. - Long training time for large datasets. - Difficult to understand and interpret the final model, variable weights and individual impact. - Since the final model is not so easy to see, small calibrations cannot be done to the model hence its tough to incorporate our business logic. Finally, the last group of the ML cluster are the connectionists, which have origins in neuroscience, because they’re trying to take inspiration from how the brain works. This is the cluster that will be further analyzed, and will give additional details in the next section. On figure `1.8 <#fig:nn>`__ a brief representation of the connectionists algorithm, a neural network. .. figure:: ../_static/images/learning/NeuralNetwork.webp :alt: Connectionists Representation Connectionists Representation. In the seek for knowledge in SL (in particular on DL), it’s explored ways to interconnect different types of DL architectures in order to solve yet unsolved problems, such as video context awareness classifiers in collision detectors, data fusion and correlation, or even clinical future estimation. Computer vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in NN approaches have greatly advanced the performance of these SoA visual recognition systems. This next subsection is a deep dive into details of the DL architectures with a focus on learning end-to-end models and datasets for these tasks, particularly image classification. This part of this documentation, will give detailed resume about neural networks and gain a detailed understanding of cutting-edge research in computer vision that is later reused and applied to create our collision avoidance algorithm. .. _sec:deep_learning: Deep Learning ------------- ANNs were inspired by information processing and distributed communication nodes in biological systems. ANNs have several differences from biological brains. Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analog :cite:p:`Marblestone2016, Olshausen1996, Scellier2016`. The term DL was introduced to the ML community by Rina Dechter in 1986, :cite:p:`Dechter1986, Schmidhuber2015` and to artificial NN by Igor Aizenberg et. al in 2000, in the context of Boolean threshold neurons :cite:p:`Aizenberg2001, Gomez2005`. DL is part of a broader family of ML methods based on artificial NN :cite:p:`Schmidhuber2015, Lecun2015`. The first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1965 :cite:p:`Ivakhnenko1965`. A 1971 paper described a deep network with 8 layers trained by the group method of data handling algorithm :cite:p:`Ivakhnenko1971`. The work on DL in computer vision was slightly hibernated, until in 1989, Yann LeCun et al. applied the standard backpropagation algorithm, which had been around as the reverse mode of automatic differentiation since 1970. The impact of DL in industry began in the early 2000s, when CNNs already processed an estimated 10% to 20% of all the checks written in the US :cite:p:`Lecun2015`. DL architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts :cite:p:`Schmidhuber2015, Mousavi2018, Arel2010, Guo2016, Ahmad2019`. Modern CNNs are considered as one of the best techniques for learning image and video content showing SoA results on image recognition, segmentation, detection, and retrieval related tasks :cite:p:`Liu2019,Ciresan2012`. The success of CNN has captured attention beyond academia. In industry, companies such as Google, Microsoft, AT&T, Facebook and PDM have developed active research groups for exploring new architectures of CNN :cite:p:`Deng2013`. In DL, each level learns to transform its input data into a slightly more abstract and composite representation. In an image recognition application, the raw input is a matrix of pixels, where the first representational layer may abstract the pixels and encode edges, the second layer may compose and encode arrangements of edges, the third layer may encode a nose and eyes and the fourth layer may recognize that the image contains a face. Moreover, a DL process can learn which features to optimally place in which level on its own :cite:p:`Bengio2013,Lecun2015b`. The *deep* in *deep learning* refers to the number of layers through which the data is transformed. More precisely, DL systems have a substantial CAP depth. The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For a feedforward neural network, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For recurrent neural networks, in which a signal may propagate through a layer more than once, the depth is potentially unlimited :cite:p:`Schmidhuber2015`. No universally agreed upon threshold of depth divides shallow learning from DL, but most researchers agree that DL involves :math:`depth > 2`. CAP of depth 2 has been shown to be a universal approximator in the sense that it can emulate any function :cite:p:`Hinton2006`. Beyond that more layers do not add to the function approximator ability of the network. Deep models are able to extract better features than shallow models and hence, extra layers help in learning features. The DL architectures are often constructed with more layers then the necessary, which helps to disentangle these abstractions and pick out which features improve performance. CNN topology is divided into multiple learning stages composed of a combination of the convolutional layer, non-linear processing units, and subsampling layers :cite:p:`Jarrett2009`. As shown in Figure `1.9 <#fig:cnnarquitecture>`__, the architecture of a typical CNN model is structured as a series of layers. Each layer performs multiple transformations using a bank of convolutional kernels (filters) :cite:p:`LeCun2010`. All the components involved in such architecture will be later described in section `1.3 <#sec:cnn_blocks>`__. Convolution operation extracts locally correlated features by dividing the image into small slices (similar to the retina of the human eye), making it capable of learning suitable features. Output of the convolutional kernels is assigned to non-linear processing units, which not only helps in learning abstraction but also embeds non-linearity in the feature space. The non-linearity outputs different patterns of activations for different responses, which facilitates the learning of semantic in different images. This is usually followed by subsampling, which helps in compressing the results and also makes the input invariant to geometrical distortions :cite:p:`LeCun2010,Scherer2010`. .. figure:: ../_static/images/learning/CNNPresentation.webp :alt: The architecture of a standard Convolutional Neural Network model The architecture of a standard Convolutional Neural Network model. The work conducted by Hubel and Wiesel’s :cite:p:`Hubel1962,Hubel1968` inspired the initial architectural designs of CNNs, following the basic structure of primate’s visual cortex. As illustrated in figure `1.10 <#fig:cnnevolution>`__, the first steps can be considered in 1980, with the initial work in Neocognition like networks :cite:p:`Fukushima1980`. Using this knowledge, Yann LeCun :cite:p:`LeCun1989` proposed a grid-like topological data, which displayed the hierarchical feature extraction ability of CNNs. .. figure:: ../_static/images/learning/CNNTimelineV2.webp :alt: Convolutional Neural Networks evolution over the years Convolutional Neural Networks evolution over the years. This hierarchical organization emulates the deep and layered learning process of the Neocortex in the human brain, which extract features from the underlying world :cite:p:`Bengio2009`. The engineered process in CNN resemblance with V1-V2-V4-IT/VTC primate’s ventral pathway of visual cortex :cite:p:`laskar2018correspondence`. The retinotopic area provide input to primates visual cortex, where contrast normalization and multi-scale highpass filtering is performed by the lateral geniculate nucleus. Afterwards, different regions of the visual cortex categorized as V1, V2, V3, and V4 classify and detect information. The V1 and V2 areas of the visual cortex can be imagined as the convolutional, and subsampling layers, whereas inferior temporal region are similar to the final layers of CNN, which makes inference about the image :cite:p:`GrillSpector2018`. CNN training is similar to standard NN, where the weights are regulated with backpropagation algorithm, iterating over multiple input images. In backpropagation, the objective is to minimize a cost function, similar to the response based learning of human brain:cite:p:`Najafabadi2015`. The revolution of the use of CNNs for image understating and segmentation occurred when it was discover that the results could be improved by tweaking with layers depth :cite:p:`Krizhevsky2017`. Deep CNN architectures have advantage over shallow architectures when dealing complex learning problems. Using multiple linear and non-linear neurons in a layer wise mode, enhances this deep networks with the ability to learn representations at different levels of abstraction. Additionally, advances in hardware enabled the renewed the interest. In 2009, Nvidia was involved in what was called the big bang of deep learning, as DNN were trained with Nvidia GPUs :cite:p:`Dixon2016`. That year, Google Brain used Nvidia GPUs to create capable DNNs. While there, Andrew Ng determined that GPUs could increase the speed of DL systems by about 100 times :cite:p:`TheEconomist2010`. In particular, GPUs are well-suited for the matrix/vector math involved in ML :cite:p:`Oh2004, Darmatasia2017`. GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days :cite:p:`Ciresan2010, Raina2009`. Specialized hardware and algorithm optimizations can be used for efficient processing :cite:p:`Sze2017`. Significant additional impacts in image or object recognition were noticed from 2011 to 2012. Although CNNs trained by backpropagation had been around for decades, and GPU implementations of NNs for years, including CNNs, fast implementations of CNNs with max-pooling on GPUs in the style of Ciresan and colleagues were needed to progress on computer vision :cite:p:`Oh2004, LeCun2008, Ciresan2011`. Image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs :cite:p:`Vinyals2015, Fang2015, Zhong2011`. Some researchers assess that the October 2012 ImageNet victory anchored the start of a deep learning revolution that has transformed the AI industry :cite:p:`Metz2016`. Multiple improvements in CNNs learning strategy and architectures have been presented to make CNNs scalable to large and complex problems. These innovations can be divided as regularization, structural reformulation, parameter optimization and computation efficiency. Major innovations in CNN have been proposed since 2012 and were mainly due to restructuring of processing units and designing of new blocks. Zeiler and Fergus :cite:p:`Zeiler2014` presented the concept of layer-wise visualization of features, which shifted the trend towards features extraction at low spatial resolution in deep architecture such as VGG :cite:p:`Simonyan2015`. Currently, most of the new architectures are built upon the principle of simple and homogeneous topology as it was presented by VGG. However, Google group introduced an interesting idea of split, transform, and merge, which is known as an *inception block*. The inception block gave the concept of branching within a layer, which allows features abstraction at different spatial scales :cite:p:`Szegedy2015`. In 2015, the concept *connections skips* was introduced by ResNet :cite:p:`He2016`. Afterwards, this concept was used by most of the succeeding NN, such as Inception-ResNet, WideResNet and ResNext :cite:p:`Xie2017,Szegedy2017,Zagoruyko2016`. Towards the improvement of learning capacities of CNNs, different design such as WideResNet, Pyramidal Net, Xception have been proposed, exploring the effect of transformations of additional cardinality and increase in width :cite:p:`Xie2017,Zagoruyko2016,Han2017`. The focus of research moved from parameter optimization and connections optimization towards improved architectural design (layer structure) of the network. This change resulted in many new architectural blocks such as channel boosting, spatial and channel wise exploitation and attention based information processing :cite:p:`khan2018new,Woo2018,Wang2017`. The overfitting problems are raised by the added layers of abstraction, which allow them to model rare dependencies in the training data. Regularization methods such as Ivakhnenko’s unit pruning or weight decay or sparsity can be applied during training to combat overfitting :cite:p:`Bengio2013`. Alternatively dropout regularization technique randomly omits neurons from the hidden layers during training. This helps to exclude rare dependencies :cite:p:`Dahl2013`. Finally, data can be augmented via methods such as cropping and rotating such that smaller training sets can be increased in size to reduce the chances of overfitting, which will be detailed in `1.4 <#sec:dldataaug>`__. The learning computation time comes from the many training parameters of the standard DNNs, such as the size (number of layers and number of neurons per layer), the learning rate, and initial weights. Sweeping through the parameter space for optimal parameters may not be feasible due to the cost in time and computational resources. Various tricks, such as batching (computing the gradient on several training examples at once rather than individual examples) :cite:p:`Hinton2012` speed up computation. Large processing capabilities of many-core architectures (such as GPUs or the specialized CPUs such as Intel Xeon Phi) have produced significant speedups in training, because of the suitability of such processing architectures for the matrix and vector computations :cite:p:`Viebke2019,You2017`. In the recent years, many different surveys were conducted on CNNs that depicted and compared their basic components. The survey reported by Gu :cite:p:`Gu2018` has reviewed the famous models from 2012-2015 along with their core blocks. There are also other similar surveys in literature that discuss different algorithms of CNN and focus on applications for demonstration of results :cite:p:`Najafabadi2015,Guo2016,Liu2017,Srinivas2016,LeCun2010`. The following subsections of `1.3 <#sec:cnn_blocks>`__ tried to aggregate this information in a concise, yet vast and wide explanation of the field, detailing building blocks, data sets and models. .. _sec:cnn_blocks: Basic CNNs Building Blocks -------------------------- For most of the perception applications, CNN is considered as the most widely used ML technique. A typical block diagram of an ML system was shown in Figure `1.9 <#fig:cnnarquitecture>`__. Since, SoA CNNs possesses both good feature extraction and strong discrimination ability, the most common task are feature extraction and classification. The most common CNN architecture is composed of alternated layers of convolution and pooling followed by one or two fully connected layers at the end. In some cases, the fully connected layers are swapped with global average pooling layer. In addition to the various learning stages, different regulatory units, such as batch normalization and dropout are also incorporated to optimize CNN performance :cite:p:`Bouvrie2006`. The structure of CNNs components play a fundamental role in new architectures designs and thus achieving enhanced performance. This subsection briefly describes and discusses the role of these components in CNN architecture. .. _sec:cnn_blocks_conv_layer: Convolutional Layer ~~~~~~~~~~~~~~~~~~~ A convolutional layer (sometimes denominated conv layer) is composed of a set of convolutional kernels (where each neuron act as a kernel). These kernels are linked with a small area of the image known as a *receptive field*. The image is divided into small blocks (receptive fields) and convoluted with a specific set of weights (multiplying elements of the filter with the corresponding receptive field elements) :cite:p:`Bouvrie2006`. This operation have similarities of a convolutional, but they are mathematically different. Convolution layer operation can expressed as follows: .. math:: C_{l}^{k} = P_{x,y}^{k} * K_{l}^{k} \label{eq:conv_layer} On equation `[eq:conv_layer] <#eq:conv_layer>`__, the input pixel of the image is represented by :math:`P_{x,y}`, :math:`x`, :math:`y` shows spatial locality and :math:`K_{l}^{k}` represents the :math:`l^{th}` convolutional kernel of the :math:`k^{th}` layer. Dividing the image into small blocks helps extracting local pixel correlations. Different set of features within the image are extracted by sliding convolutional kernel on the whole image with the same set of weights. This weight sharing on the kernels of convolution operation makes CNN parameters efficient when compared to fully connected NN. The convolution operation may further be categorized into different types based on the type and size of filters, type of padding, and the direction of convolution :cite:p:`Lecun2015`. If the kernel is symmetric, the convolution operation becomes a correlation operation :cite:p:`IanGoodfellowYoshuaBengio2015`. .. figure:: ../_static/images/learning/ConvOperation.webp :alt: Convolutional layer destination feature value calculation example Convolutional layer destination feature value calculation example. On figure `1.11 <#fig:cnnevolutionexample>`__ is represented an example of the kernel sliding over a source pixel. Initially, the center element of the kernel is placed over the source pixel. Afterwards, the destination pixel, :math:`C_{l}^{k}` is then calculated with the weighted sum of itself and nearby pixels. On this example, the resulting destination feature value can be calculated as : .. math:: C_{l}^{k} = 0*3 + 0*2 + 0*1 + 0*4 + 3*1 + 1*2 + 0*2 + 2*3 + 1*5 = 16 \label{eq:conv_layer_exemple} .. _sec:cnn_blocks_pooling_layer: Pooling Layer ~~~~~~~~~~~~~ The convolution operation outputs feature maps. Once features values are calculated, its exact location becomes less important as long as its approximate position relative to others is preserved. Pooling or downsampling like convolution, is a local operation. It sums up similar information in the neighborhood of the receptive field and outputs the dominant response within this local region :cite:p:`Lee2016,Lee2018`. On equation `[eq:pooling_layer] <#eq:pooling_layer>`__ is represented the pooling operation in which :math:`Z_{l}` represents the :math:`l^{th}` output feature map, :math:`C_{x,y}^{l}` represents the :math:`l^{th}` input feature map, whereas :math:`f_{p}(x)` defines the type of pooling operation. .. math:: Z_{l} = f_{p} ( C_{x,y}^{l} ) \label{eq:pooling_layer} The use of pooling operation extracts a combination of features, which are invariant to translational shifts and distortions :cite:p:`Ranzato2007,Scherer2010`. Reduction in the size of feature map to invariant feature set not only reduces network complexity and also increases generalization by reducing overfitting. The most common types of pooling formulations are :cite:p:`He2015,Wang2012`: - Max pooling. - Average pooling. - L2 pooling. - Overlapping pooling. - Spatial pyramid pooling .. _sec:cnn_blocks_activation_layer: Activation Function ~~~~~~~~~~~~~~~~~~~ On classification problems, activation functions are used as a decision function, helping to differentiate complex classes. The selection of the activation function can also accelerate the learning process. For CNNs, activation functions of the convolved feature map is defined in equation can be defined as: .. math:: T_{l}^{k} = f_{A} ( C_{l}^{k} ) \label{eq:activation_layer} On equation `[eq:activation_layer] <#eq:activation_layer>`__, :math:`C_{l}^{k}` is the output of a convolution operation, which is mapped to an activation function :math:`f_{A}(x)`. This activation function adds non-linearity and returns the resulting output :math:`T_{l}^{k}` for :math:`k^{th}` layer. In academia, different activation functions such as sigmoid, tanh, maxout, ReLU, and variants of ReLU such as leaky ReLU, ELU, and PReLU :cite:p:`Wang2017,Wang2012,xu2015empirical,LeCun2012`, are used to inculcate nonlinear combination of features. However, ReLU and its variants are preferred over others activations as it helps in overcoming the vanishing gradient problem :cite:p:`nwankpa2018activation,Hochreiter1998`. Many improvements to the learning progress were only possible due to the research of new activation functions. The backpropagation this functions derivatives, so it’s also important to have a clear idea of the activation functions derivatives, because backpropagation is a leaky abstraction (it might use a credit assignment scheme with non-trivial consequences). Linear ~~~~~~ The linear activation function, as described in table `1.2 <#table:linear>`__, is the most basic activation function. It can be seen as a straight line function where activation is proportional to input (which is the weighted sum from neuron). For the derivative graph, a value of :math:`m=1` was considered. .. container:: :name: table:linear .. table:: Activation Function Linear resume. +-----------------+------------------------+------------------------+ | | **Function** | **Derivative** | +=================+========================+========================+ | **Formula** | :math:`R(z,m) = z*m` | :math:`R'(z,m) = m` | +-----------------+------------------------+------------------------+ | **Python code** | :: | :: | | | | | | | | | | | def linear(z,m): | def linear_der(z,m): | | | return m*z | return (m * z ) / z | | | | | +-----------------+------------------------+------------------------+ The main advantages of using a linear activation function can be described as: - It gives a linear value, for range of activations, which can be used in both regression and classification. - It’s possible to utilize multiple neurons together, and do simple classifications afterwards, such as considering the max value fired. On the other hand, the linear activation function as some disadvantages, such as: - The derivative is a constant. This has a negative impact on the backpropagation, because the gradient has no relationship with :math:`x`. - It’s not possible to utilize gradient descent for leaning, because it’s going to be on constant gradient. .. _sec:relu: ReLU ~~~~ ReLU is the most used activation function in nowadays applications, mainly because the formula is deceptively simple: :math:`max(0,z)`. Despite its name and appearance, it’s not linear and provides the same benefits as the traditional Sigmoid but with better performance due to it’s computational simplicity. This activation function has been summarized in table `1.3 <#table:relu>`__. .. container:: :name: table:relu .. table:: Activation Function ReLU resume. +------------------+------------------------------------------------------------------------+-----------------------------------------------------------------+ | | **Function** | **Derivative** | +==================+========================================================================+=================================================================+ | **Formula** | :math:`R(z) = \begin{Bmatrix} z & z > 0 \\ 0 & z <= 0 \end{Bmatrix}` | :math:`R'(z) = \begin{Bmatrix} 1 & z>0 0 & z<0 \end{Bmatrix}` | +------------------+------------------------------------------------------------------------+-----------------------------------------------------------------+ | **Python code** | .. code:: python | .. code:: python | | | | | | | def relu(z): | def relu_der(z): | | | return np.where(z >= 0, z, 0) | return np.where(z >= 0, 1, 0) | +------------------+------------------------------------------------------------------------+-----------------------------------------------------------------+ The advantages of using ReLU are quite trivial to understand, but it was a big breakout on the CNNs :cite:p:`Wang2017,Wang2012,xu2015empirical`. The main ones can be considered as: - It avoids and rectifies vanishing gradient problem that were present on the antecedent activation functions. - It is less computationally expensive than the tanh and sigmoid because it involves simpler mathematical operations. Due to its popularity, several researchers have detected some disadvantages in this technique, and have proposed alternative versions :cite:p:`Wang2017,Wang2012,xu2015empirical`. Some of these disadvantages are: - The range of ReLU is :math:`[0, \infty]`. This means it has no positive boundary, which makes the classification problem harder, and can force the CNN to overshot. - It should only be used within Hidden layers of a Neural Network Model. There’s no advantage of cropping the negative values in the output layer. - Some gradients can be fragile during training and get discarded. Usually, when this happens, the neuron will update the weights to values which produce negative :math:`x` results. When this values are passed to the ReLU it will always returns 0, and due to it’s derivative, it is never again updated, which can be considered a ’dead neuron’. - In another words, f activations in the region :math:`(x<0)` of ReLU, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). This is called dying ReLU problem. Some studies conducted to SoA CNN realized that in many architectures, more then 90% of the network is composed of ’dead neuron’ :cite:p:`NIPS2015_5784`. ELU ~~~ The activation function ELU usually converges to zero in a few epoch, which generate fast training and produce more accurate results. Different to other activation functions, ELU uses an :math:`\alpha` constant which needs to be positive number. As analyzed in section `1.3.5 <#sec:relu>`__, ELU is similar to ReLU, except on the negative inputs region. Both functions are a identity function for positive inputs. On the other hand, ELU becomes smooth slowly until its output equal to :math:`- \alpha` whereas ReLU swaps to 0 instantaneously, as presented in table `1.4 <#table:elu>`__. .. container:: :name: table:elu .. table:: Activation Function ELU resume. +------------------+--------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ | | **Function** | **Derivative** | +==================+========================================================================================================+=====================================================================================+ | **Formula** | :math:`R(z) = \begin{Bmatrix} z & z > 0 \\ \alpha * \left ( e^{z} - 1 \right ) & z <= 0 \end{Bmatrix}` | :math:`R'(z) = \begin{Bmatrix} z & z > 0 \\ \alpha * e^{z} & z <= 0 \end{Bmatrix}` | +------------------+--------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ | **Python code** | .. code:: python | .. code:: python | | | | | | | def elu(z,alpha): | def elu_der(z,alpha): | | | return np.where(z >= 0, z, alpha*(np.exp(z) -1)) | return np.where(z >= 0, 1, alpha*np.exp(z)) | +------------------+--------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+ Some of the benefits of ELU are :cite:p:`xu2015empirical`: - ELU becomes smooth slowly until its output equal to :math:`- \alpha` whereas ReLU sharply smoothes. - ELU is a strong alternative to ReLU. - Unlike to ReLU, ELU can produce negative outputs. Nonetheless, for :math:`x > 0`, the ELU activation function can also start overshot with the output range of [0, :math:`\infty`] :cite:p:`xu2015empirical`. LeakyReLU ~~~~~~~~~ LeakyRelu is yet another variant of ReLU. Instead of being 0 when :math:`z < 0`, a leaky ReLU allows a narrow, non-zero, constant gradient :math:`\alpha` (usually the value :math:`\alpha = 0.01` is considered). However, the consistency of the benefit across tasks is presently unclear. This activation function has been summarized in table `1.5 <#table:leakyrelu>`__. Even thought that usually the value :math:`\alpha = 0.01` is considered, for a better graphical representation of the ’leaking’ property, a value of :math:`\alpha = 0.1` has been considered. .. container:: :name: table:leakyrelu .. table:: Activation Function LeakyReLU resume. +------------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------+ | | **Function** | **Derivative** | +==================+==================================================================================+==========================================================================+ | **Formula** | :math:`R(z) = \begin{Bmatrix} z & z > 0 \\ \alpha * z & z <= 0 \end{Bmatrix}` | :math:`R'(z) = \begin{Bmatrix} 1 & z>0 \\ \alpha & z<0 \end{Bmatrix}` | +------------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------+ | **Python code** | .. code:: python | .. code:: python | | | | | | | def leakyrelu(z, alpha): | def leakyrelu_der(z, alpha): | | | return np.where(z >= 0, z, alpha * z) | return np.where(z>=0, 1, alpha) | +------------------+----------------------------------------------------------------------------------+--------------------------------------------------------------------------+ Leaky ReLUs are a clear attempt to fix the dead neurons problems of ReLU. By having a narrow negative slope, it allow the gradient to always have an opportunity to train the network, and the possibility to tweak the weights to place it in the positive :math:`x` region. Nevertheless, it possess linearity, so it shouldn’t be used for the classification tasks :cite:p:`Wang2017,Wang2012`. Sigmoid ~~~~~~~ The Sigmoid activation function receives as input a real value and outputs a value between 0 and 1, as can extrapolated from table `1.6 <#table:sigmoid>`__. This makes it easy to apply because it contains the most desired proprieties for an activation function. It’s non-linear, continuously differentiable, monotonic, and has a well defined output range :cite:p:`LeCun2012`. .. container:: :name: table:sigmoid .. table:: Activation Function Sigmoid resume. +-----------------+------------------------------------------+--------------------------------------------------+ | | **Function** | **Derivative** | +=================+==========================================+==================================================+ | **Formula** | :math:`S(z) = | :math:`S'(z) = | | | \frac{1} {1 + e^{-z}}` | S(z) \cdot (1 - S(z))` | +-----------------+------------------------------------------+--------------------------------------------------+ | **Python code** | .. code:: python | .. code:: python | | | | | | | def sigmoid(z): | def sigmoid_der(z): | | | return 1.0 / (1 + np.exp(-z)) | return sigmoid(z)*(1-sigmoid(z)) | +-----------------+------------------------------------------+--------------------------------------------------+ The main advantages of the sigmoid function can be described as :cite:p:`LeCun2012`: - It is nonlinear function, which if combined multiple times, represents a complex output space easier then a linear function. - Produces an analog activation with step function reassembly. - It has a smooth gradient. - The step like shape, gives good results in classification applications. - The output of the activation function is always going to be in range :math:`[0,1]` compared to :math:`[-\infty, \infty]` of linear like function. This prevents the output from overshooting. On the other hand, some disadvantages have been identified by researchers :cite:p:`LeCun2012`, in concrete: - At the extremes of the sigmoid function, the :math:`y` values fluctuations are ignored in the :math:`X` response. - Also, on the extremes, the gradient tend to 0, which generates the problem of vanishing gradients :cite:p:`Hochreiter1998`. - Optimization is not trivial, because the output is not zero centered. On the zero region, the gradient is higher which make the updates flow in different directions. - Random weight initialization, can make the network to refuse to learn drastically slow. Tanh ~~~~ Tanh activation function is similar to sigmoid, but with the output zero centered, as it is presented in table `1.7 <#table:tanh>`__. Usually tanh is prefered over sigmoid, due to it’s center :cite:p:`Kalman2003, xu2016revise`. .. container:: :name: table:tanh .. table:: Activation Function Tanh resume. +-----------------+-------------------------------------------------------------------+---------------------------------------------------+ | | **Function** | **Derivative** | +=================+===================================================================+===================================================+ | **Formula** | :math:`T | :math:`T' | | | (z) = \frac{e^{z} - e^ | (z) = 1 - T(z)^{2}` | | | {-z}}{e^{z} + e^{-z}}` | | +-----------------+-------------------------------------------------------------------+---------------------------------------------------+ | **Python code** | .. code:: python | .. code:: python | | | | | | | | | | | def tanh(z): | def tanh_der(z): | | | return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z)) | return 1 - np.power(tanh(z), 2) | +-----------------+-------------------------------------------------------------------+---------------------------------------------------+ Kalman calculated a function based on tanh :cite:p:`Kalman2003`. On his study, he concluded that for deep networks, the gradient is stronger for tanh than sigmoid (due to the derivatives being steeper), which leads to a faster inference. Nonetheless, both sigmoids and tanh don’t address the vanishing gradient problem. .. _sec:softmax: Softmax ~~~~~~~ Finally, the last activation function this dissertation will look into is the Softmax. It calculates the probabilities distribution of the event over :math:`N` different events, where the :math:`N` is the size of the output array. This function is a quite different from the previous presented in it’s conception. The probabilities of each target class over all possible target classes is calculated utilizing an :math:`N` dimensional vector of arbitrary real values and producing another :math:`N` dimensional vector with real values in the range :math:`[0, 1]` that add up to :math:`1.0`. This is demonstrated in equation `[eq:softmaxvectors] <#eq:softmaxvectors>`__. .. math:: \label{eq:softmaxvectors} S(z) : \begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_N \end{bmatrix} \rightarrow \begin{bmatrix} S_1 \\ S_2 \\ \vdots \\ S_N \end{bmatrix} Since these output are already a probabilities from :math:`[0,1]`, the results can be directly mapped to target classes, and the training not only maximize a value, but also maximize the disparity of triggers. From a mathematical point of view, where the rest of the functions could be analyzed from a escalar view, softmax is fundamentally a vector function. It takes a vector as input and produces a vector as output. In other words, it has multiple inputs and multiple outputs. Therefore, it’s not possible to represent ’the derivative of softmax’. For this reason, in this activation function is presented more detail. Since softmax has multiple inputs, with respect to which input element the partial derivative should be computed. Thus, it’s necessary to find the partial derivatives: .. math:: \label{eq:softmaxder} \frac{\partial S_i}{\partial z_j} This is the partial derivative of the :math:`i^{th}` output with respect to the :math:`j^{th}` input. A shorter way to write the partial derivative that will be used going forward is :math:`D_j S_i`. Since softmax is a :math:`\mathbb{R} ^N \rightarrow \mathbb{R} ^N` function, the most general derivative computed for it is the Jacobian matrix: .. math:: \label{eq:softmaxjaco} DS = \begin{bmatrix} D_1 S_1 & \cdots & D_N S_1 \\ \vdots & \ddots & \vdots \\ D_1 S_N& \cdots & D_N S_N \end{bmatrix} Computing for arbitrary i and j: .. math:: \label{eq:softmaxds} D_j S_i = \frac{\partial S_i}{\partial z_j} = \frac{\partial \frac{ e^{z_i}}{\sum_{k=1}^{N} e^{z_k}}}{\partial z_j} Note that no matter which :math:`z_j` is yielded, the derivative of the denominator :math:`\sum_{k=1}^{N} e^{z_k}`, will always yell :math:`e^{z_j}`. This is not the case for numerator :math:`e^{z_i}`. The derivative of :math:`e^{z_i}` with respect to :math:`z_j` is :math:`e^{z_j}` only if :math:`i = j`, because only then :math:`e^{z_i}` has :math:`z_j` anywhere in it. Otherwise, the derivative is :math:`0`. In result, it’s obtained that :math:`D_j S_i` can be calculated by: .. math:: \label{eq:softmaxfder} D_j S_i = \begin{Bmatrix}S_i (1 - S_j) & i = j \\ - S_i S_j & i \neq j\end{Bmatrix} In ML literature, the term gradient is commonly used to stand in for the derivative. Gradients are only defined for scalar functions (such as the functions described in the previous sections). For vector functions like softmax it’s imprecise to present it as a gradient. The Jacobian is the fully general derivate of a vector function. Nevertheless, for the sake of coherence, a resume table `1.8 <#table:softmax>`__ is presented. Keep in mind that both the graphical representation and direct code derivative in Python, the results don’t express much importance. For being a vectorial function, the resulting value :math:`S(z)` for a given :math:`z` will be highly depend of the number :math:`N` of the the :math:`z` array, that for this representation :math:`60` points from :math:`[-6,6]` were considered. In practical applications the Jacobian Matrix is calculated, and for graphical representation, the probability of the target class is usually preferred. .. container:: :name: table:softmax .. table:: Activation Function Softmax resume. +-----------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------+ | | **Function** | **Derivative** | +=================+===============================================================+==========================================================================================+ | **Formula** | :math:`S ( | :math:`S' | | | z_i) = \frac{e^{z_i}}{ | (z_i) = \begin{Bmatrix}S_i * (1 - S_j) & i = j \\ - S_i * S_j & i \neq j\end{Bmatrix}` | | | \sum_{1}^{j} e^{z_j}}` | | | | | | | | | | +-----------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------+ | **Python code** | .. code:: python | .. code:: python | | | | | | | | | | | def softmax(x): | def softmax_der(x): | | | return np.exp(x) / np.sum(np.exp(x), axis=0) | sm = softmax(x) | | | | return sm * (1 - sm) | +-----------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------+ The basic practical difference between Sigmoid and Softmax is that while both give output in :math:`[0,1]` range, softmax ensures that the sum of outputs along channels (as per specified dimension) is always :math:`1`, which enables them to be directly mapped to classes probabilities estimation. Sigmoid just makes outputs between :math:`[0,1]`. Hence, if a one hot encoding scheme is being used, where one channel has probabilities of one class and other channel has probabilities of another, then Softmax activation is preferred. .. _sec:cnn_blocks_batchnorm_layer: Batch Normalization ~~~~~~~~~~~~~~~~~~~ Batch normalization is used to address the issues related to internal covariance shift within feature maps. The internal covariance shift is a change in the distribution of hidden units’ values, which slow down the convergence (by forcing learning rate to small value) and requires careful initialization of parameters. Batch normalization for a transformed feature map :math:`T^{k}_{l}` can be represented as: .. math:: N_{l}^{k} = \frac{C_{l}^{k} - \mu _{B} }{ \sqrt{ \sigma _{B}^{2} + \varepsilon } } \label{eq:normalization_layer} In equation `[eq:normalization_layer] <#eq:normalization_layer>`__, :math:`N^{k}_{l}` represents normalized feature map, :math:`C^{k}_{l}` is the input feature map, :math:`\mu _{B}` is the mean and :math:`sigma _{B}^{2}` depict the variance of a feature map for a mini batch respectively. Batch normalization unifies the distribution of feature map values by bringing them to zero mean and unitary variance :cite:p:`Ioffe2015`. Furthermore, it smooths the flow of gradient and acts as a regulating factor, which thus helps in improving generalization of the network. .. _sec:cnn_blocks_dropout_layer: Dropout ~~~~~~~ The Dropout technique introduces regularization in the network, which ultimately reduces overfitting by randomly skipping some units or connections with a certain probability. In DNNs, multiple connections that learn a non-linear relation are sometimes co-adapted, which reduces generalization :cite:p:`hinton2012improving`. This random dropping of some connections or units force all neurons to be utilized, by making thinned network architectures trains, and finally one representative network with all weights. This selected architecture is then considered as an approximation of all of the proposed networks :cite:p:`Srivastava2014`. .. _sec:cnn_blocks_fullycon_layer: Fully Connected Layer ~~~~~~~~~~~~~~~~~~~~~ Fully connected layers are used at the end of the networks for classification or regression purposes. It takes input from the previous layer and globally analyses output of all the preceding layers :cite:p:`Lin2014nn`. This makes a non-linear combination of selected features, which are used for the classification of data :cite:p:`Rawat2017`. For being a process that crosses all values, the number of operations and weights involved usually surpasses the rest of the entire network. .. _sec:dldataaug: Data Augmentation ----------------- Data augmentation is an effective technique for improving the accuracy of CNNs :cite:p:`Shorten2019`. Usually Data Augmentations uses transformations such as flipping, color space augmentations, and random cropping. These transformations encode many of the invariance that present challenges to image recognition tasks. Some more advance data augmentations techniques are GAN-based augmentation, neural style transfer, and meta-learning schemes :cite:p:`konnoicing2018,DeVries2019`. This section will explain how the common augmentation algorithms works, illustrate experimental results, and discuss disadvantages of the augmentation technique. Some frameworks such as Keras :cite:p:`Keras2019` provide ways to perform Data Augmentation on the fly, rather than performing the operations on your entire image dataset in memory. The API is designed to be iterated by the deep learning model training process, creating augmented image data for the algorithm on run-time. This reduces memory overhead, but adds some additional computation during model training, which result in a longer training time. In Keras, the IDG calculate the statistics required to actually perform the transforms to the image data. The data generator itself is in fact an iterator, returning batches of image samples when requested. In the most used ML frameworks, when data augmentation is applied, instead of calling the fit function on the model, it’s necessary to call the fit generator function and pass in a IDG with the desired length of an epoch, as well as the total number of epochs on which to train. The MNIST dataset :cite:p:`LeCun1998` was used in order to have a common set of example images. On figure `1.12 <#fig:pointsofcompare>`__ a set of nine images is represented to have a base of comparison for Image Augmentation algorithms. .. figure:: ../_static/images/learning/PointOfCompare.webp :alt: Point Of Comparison Point Of Comparison. Feature Standardization ~~~~~~~~~~~~~~~~~~~~~~~ Standardization typically means data rescaling, in order to have a mean of :math:`\mu = 0` and a standard deviation of :math:`\sigma =1` (unit variance). Feature Standardization allows to normalize pixel values across an entire dataset. It mirrors the type of standardization often performed for each column in tabular dataset :cite:p:`Shen2016a`. Usually this is done by performing the equation `[eq:features] <#eq:features>`__: .. math:: \label{eq:features} x' = \frac{x - \bar{x}}{\sigma} On Keras framework, this is achieved by setting the feature-wise center and feature-wise standard normalization arguments on the IDG class. Applying feature standardization on the images of figure `1.12 <#fig:pointsofcompare>`__, it’s possible to achieve the result represented on figure `1.13 <#fig:data_aug_feature_std>`__, which result in images seemingly darkening and lightning different digits. .. figure:: ../_static/images/learning/StandardizeImages.webp :alt: Feature Standardization Feature Standardization. ZCA Whitening ~~~~~~~~~~~~~ A whitening transform of an image is a linear algebra operation that reduces the redundancy in the matrix of pixel images. Less redundancy in the image is intended to better highlight the structures and features in the image to the learning algorithm :cite:p:`Li2015a`. Considering :math:`N` data point in :math:`\mathbb{R}^n`, the covariance matrix is :math:`\Sigma \in \mathbb{R}^{n \times n}` estimated to be: .. math:: \label{eq:zcawhiteone} \hat{\Sigma}_{jk} = \frac{1}{N-1} \sum_{i=1}^N (x_{ij} - \bar{x}_j) \cdot (x_{ik} - \bar{x}_k) In equation `[eq:zcawhiteone] <#eq:zcawhiteone>`__, :math:`\bar{x}_j` denotes the :math:`j^{th}` component of the estimated mean of the samples :math:`x`. Any matrix :math:`W \in \mathbb{R}^{n \times n}` which satisfies the condition :math:`W^T W = C^{-1}` whitens the data. Typically, image whitening is performed using the PCA technique. More recently, an alternative called ZCA shows better results and results in transformed images that keeps all of the original dimensions and unlike PCA, resulting transformed images still look like their originals. To execute a ZCA :math:`W = M^{- \frac{1}{2}}`. Using a ZCA Whitening transform on the sample images, the same general structure is maintained and how the outline of each digit is highlighted, as illustrated on `1.14 <#fig:zcawhitening>`__. .. figure:: ../_static/images/learning/ZCAWhitening.webp :alt: ZCA Whitening ZCA Whitening. Random Shifts ~~~~~~~~~~~~~ Objects in images may not be centered in the frame. They may be off-center in a variety of different ways. To solve this problem during training, a common technique is to train the deep learning networks to expect and handle off-center objects by artificially creating shifted versions of the training data. For example, Keras and Tensorflow supports separate horizontal and vertical random shifting of training data by the width shift range and height shift range arguments. Running this example creates shifted versions of the digits, as represented on `1.15 <#fig:randomshifts>`__. Again, this is not required for MNIST as the handwritten digits are already centered, but it is useful on more complex problem domains. .. figure:: ../_static/images/learning/RandomShifts.webp :alt: Random Shifts Random Shifts. Random Flips ~~~~~~~~~~~~ Another image data augmentation technique that improves the performance is to randomly flip the training images. On figure `1.16 <#fig:randomflips>`__ it can be seen it’s result over the sample images. On this example (MNIST dataset), flipping digits is not useful as they require the correct left and right orientation, but this may be useful for images of objects in a scene that can have different orientation. .. figure:: ../_static/images/learning/RandomFlips.webp :alt: Random Flips Random Flips. Random Rotations ~~~~~~~~~~~~~~~~ Sometimes images in the dataset may have different rotations in the scene. In those cases, it’s helpful to train the model capable of handling images rotations by artificially and randomly rotating images from the dataset during training. As seen on figure `1.17 <#fig:randomrotations>`__ the images have been rotated left and right up to a limit of 180 degrees. This is not helpful on this problem because the MNIST digits have a normalized orientation, but this transform might be of help when learning from photographs where the objects may have different orientations. Not only that but it might to some incorrect labeling. For example, the digit :math:`9` at the top right corner is transformed into a :math:`6` but will remain labeled as a :math:`9` possibly leading to a worse model. .. figure:: ../_static/images/learning/RandomRotation.webp :alt: Random Rotations Random Rotations. Additional Augmentations ~~~~~~~~~~~~~~~~~~~~~~~~ When doing runtime data augmentations it’s important not to use multiple techniques without a clear idea of the augmented results. As an example of this, it can be observed in figure `1.18 <#fig:additionalaugmentations>`__ where it was applied random shifts, ZCA whitening, standard normalization, random flips and zoom (between :math:`\left [ \frac{1}{2}, 2 \right ]`). It’s questionable if the data represented after augmentation is valid, or if require the model to learn that the number :math:`2` is a black square (bottom right image). .. figure:: ../_static/images/learning/WeirdAugImages.webp :alt: Random Rotations Data Augmentation done wrong. Additionally, some common data augmentations techniques are the rescaling and filling mode. Both this methods are usually applied after the rest of data augmentations techniques. Filling mode can have different flavors points outside the boundaries of the input are filled according to the given mode: - Constant: The outside is filled with a predefined value. .. math:: \begin{bmatrix}kkkk \left | abcd \right | kkkk \end{bmatrix} (p_val=k) - Nearest: The outside is filled with the nearest value of the last pixel. .. math:: \begin{bmatrix}aaaa \left | abcd \right | dddd \end{bmatrix} - Reflect: The outside is filled with a reflection of the values, sometimes called mirror filling. .. math:: \begin{bmatrix}dcba \left | abcd \right | dcba \end{bmatrix} - Wrap: The outside is filled with the opposite values, like the image was a cylinder and the content is wraping around. .. math:: \begin{bmatrix}abcd \left | abcd \right | abcd \end{bmatrix} Image data is unique in the way that is possible to review the data, create transformed copies and quickly get an idea of how the dataset may be perceive it by the working model. Training DNNs comes with experience, and the quality of the results are interlinked with the tweaks done to the data. For that reason, in conclusion of this page, it’s summarized some tips for getting the most from image data preparation and augmentation for DL. - Review the dataset and do some work with it before starting to train models. In most cases, only a few images actually benefit the training process of your model when augmented, such as the need to handle different shifts, rotations or flips of objects in the scene. - Inspect augmentations. It is one thing to intellectually know what image transforms to use, but in practical cases, it is very different to look at examples results. Reviewing images both with individual augmentations as well as the full set of augmentations planned may unveil ways to simplify or further enhance your model training process. - Lastly, it’s important to evaluate a suite of transforms. Trying more than one image data preparation and augmentation scheme. Often it’s possible that the results of a data preparation scheme are different that what was initially envisioned and the data augmentations are not beneficial. Bibliography ----------------- .. bibliography::