1introduction.tex

\chapter{Introduction}
\label{chapter:intro}
\emph{\acrfull{asr}} is the process of converting speech into text. \acrshort{asr} is used in a wide range of applications like digital assistants,  captioning audio conversations, etc. \acrshort{asr} systems convert digital speech audio to text transcriptions using pattern matching techniques based on similarity of features computed from the audio and match it with the models which are trained from data. Research in this field has become very dynamic in the previous decade with the application of deep neural networks for \acrshort{asr}. \acrfull{dnn} are a subset of machine learning models, which consist of multiple layers of computation units and learn to provide the required output for the given input. Applying deep neural networks for \acrshort{asr} requires speech audio data along with transcripts. This is then fed to the deep neural networks to help it to learn from the audio data, and this process is called model training. State-of-the-art \acrshort{asr} systems that use deep learning require large amounts of training data. 

Increasing the data used for training deep neural network models exposes more diverse data to the models, which have previously led to substantial increases in accuracy. Moreover, considering the decreasing costs of data storage and network costs\cite{Sayed2014ASecurity}, scaling up training to higher and higher amounts of training data has become an interesting trend over the recent years. Popular cloud-based solutions offered by big enterprises use the vast magnitude of resources available to them to train deep learning models that can process speech audio with high levels of accuracy\cite{Li2020OnRecognition}. These large-scale experiments are harder to conduct for smaller companies and for academic institutions, because of the need for high amounts of storage and processing power for the process of training, which are not easily available. Hence, there is a need for optimizing the training process to reduce the storage footprint, processing power, and resources required to train large-scale deep learning models.

Large dataset training jobs generally use distributed training, which means that multiple processing units are used with the data present in a network location. This introduces new problems like optimizing input/output read performance and managing local storage, etc. These are the problems regarding infrastructure and engineering related domains. Apart from these, there are also research-based problems which include convergence issues when using stochastic gradient descent in a distributed setup of training.  

\section{Problem statement}
\label{section:rq}
The purpose of this thesis is to explore the task of large-scale speech recognition and to optimize the training jobs which enables more wide scale adaptation of scaling up datasets using for training end to end ASR models. The aim of the thesis, therefore, are as follows:

\begin{enumerate}
  \item How to enable large-scale training tasks from an engineering point of view? What are the best  methods  to  store data, load data for such training jobs?
  \item What is the effect of training data scale on the quality of the acoustic model used?
    \begin{enumerate}
        \item Does increasing the scale of training data affect the convergence time of the acoustic model?
    \end{enumerate}
  \item How effective are the multi-\acrshort{gpu}s and multiple processing jobs to accelerate the training time and reach convergence quicker? 
    \begin{enumerate}
        \item Compare synchronous and asynchronous training methods and analyse which of them are suitable for our system setup? 
        \item What is the speed-up achieved by using these methods for training an acoustic end to end model?
    \end{enumerate}
\end{enumerate}

See Section \ref{section:ans} for the direct answers to these problem statements.


\section{Structure of the Thesis}
\label{section:structure} 
Chapter \ref{chapter:background} discusses the most essential concepts in deep learning and the network architectures used for the end to end \acrshort{asr} model. Chapter \ref{chapter:largescale} is more specific to large-scale speech recognition and its importance. It also discusses the datasets available for large-scale speech recognition. Chapter \ref{chapter:methods} explains the dataset used, and the techniques used to scale up the speech recognition task. Next, we delve into the experiments conducted and show the results in Chapter \ref{chapter:evaluation}. Chapter \ref{chapter:discussion} answers the research problems and  talks about the future work for the thesis. Finally, Chapter \ref{chapter:conclusions} concludes the thesis with the most important takeaways from the work.