Machine learning is a set of algorithms which are modeled after the human brain and are designed to take in large amounts of data and recognize patterns. What sets them apart is not just how fast they can analyze data, but the fact that they can actually learn and improve upon their results as they receive more information. Deep neural networks (DNNs) have already become integral parts of most modern industries, from banking and finance, to defense, security, health and pretty much anything else you can imagine.
However, DNNs’ rise into prominence was tightly coupled to the available computational power, which allowed to exploit their inherent parallelism. Consequently, deep learning managed to outperform all existing approaches in speech recognition and image classification
, where the latter (AlexNet) increased the accuracy by a factor of two, sparking interest outside of the community and even academia.
The challenge in designing and testing machine learning models, such as DNNs, is that they are gigantic in size – they consist of millions of parameters that need be trained. Furthermore, they need to be trained over enormous datasets to achieve good performance. With today’s technology, training large-scale models can take multiple days or weeks! Experts emphasise that traditional ML approaches are designed to address the dataset at hand which implies central processing of data in a database. However, this is usually not possible due to the fact that the cost of storing a single dataset is bigger than storing data in smaller parts. Also, the computational cost of mining a single data repository or database is bigger than processing smaller parts of data.
A promising solution for handling this problem is to leverage distributed training. In fact, companies like Google have been able to harness significantly larger datasets for problems like web search in sub-second timescales. Unfortunately, for machine learning tasks, the performance benefits of using multiple computer nodes currently do not scale well with the number of machines and often a diminishing benefit is observed as we try to scale it up. As datasets increase in size and DNNs in complexity, the computational intensity and memory demands of deep learning increase proportionally. Training a DNN to competitive accuracy today essentially requires a high-performance computing cluster. To harness such systems, different aspects of training and inference (evaluation) of DNNs are modified to increase concurrency.
Also the rise of big data and IoT has led to several distributed data sets and these big datasets stored in a central repository impose huge processing and computing requirements. And that’s why researchers assert that distributed processing of data is the right computing platform. Distributed learning also provides the best solution to large-scale learning given how memory limitation and algorithm complexity are the main obstacles. Besides overcoming the problem of centralised storage, distributed learning is also scalable since data is offset by adding more processors.
As opposed to a centralised approach, a distributed mining approach helps in parallel processing. Also, distributed learning algorithms have their foundations in ensemble learning which helps build a set of classifiers to improve the accuracy of a single classifier. An ensemble approach merges with that of a distributed environment since a classifier is trained onsite, with a subset of data stored in it.
Parallel Computer Architecture
The parallel hardware architectures that are used to execute learning problems in practice. They can be roughly classified into single-machine (often shared memory) and multi-machine (often distributed memory) systems.
Parallelism is ubiquitous in today’s computer architecture, internally on the chip in the form of pipelining and out-of-order execution as well as exposed to the programmer in the form of multi-core or multi-socket systems. Multi-core systems have a long tradition and can be programmed with either multiple processes (different memory domains), multiple threads (shared memory domains), or a mix of both. General-purpose CPUs have been optimized for general workloads ranging from event-driven desktop applications to datacenter server tasks (e.g., serving web-pages and executing complex business workflows).
Machine learning tasks are often compute intensive, making them similar to traditional high-performance computing (HPC) applications. Thus, large learning workloads perform very well on accelerated systems such as general purpose graphics processing units (GPU) or field-programmable gate arrays (FPGA) that have been used in the HPC field for more than a decade now. Those devices focus on compute throughput by specializing their architecture to utilize the high data parallelism in HPC workloads. However, even accelerated nodes are not sufficient for the large computational workload. This shows that, beginning from 2015, distributed-memory architectures with accelerators such as GPUs have become the default option for machine learning at all scales today
Training large-scale models is a very compute-intensive task. Thus, single machines are often not capable to finish this task in a desired time-frame. To accelerate the computation further, it can be distributed across multiple machines connected by a network. The most important metrics for the interconnection network (short: interconnect) are latency, bandwidth, and message-rate. Different network technologies provide different performance. For example, both modern Ethernet and InfiniBand provide high bandwidth but InfiniBand has significantly lower latencies and higher message rates. Special-purpose HPC interconnection networks can achieve higher performance in all three metrics. Yet, network communication remains generally slower than intra-machine communication.
Programming techniques to implement parallel learning algorithms on parallel computers depend on the target architecture. They range from simple threaded implementations to OpenMP on single machines. Accelerators are usually programmed with special languages such as NVIDIA’s CUDA, OpenCL, or in the case of FPGAs using hardware design languages. Yet, the details are often hidden behind library calls (e.g., cuDNN or MKL-DNN) that implement the time-consuming primitives.
On multiple machines with distributed memory, one can either use simple communication mechanisms such as TCP/IP or Remote Direct Memory Access (RDMA). On distributed memory machines, one can also use more convenient libraries such as the Message Passing Interface (MPI) or Apache Spark. MPI is a low level library focused on providing portable performance while Spark is a higher-level framework that focuses more on programmer productivity. . Thus, beginning from 2016, the established MPI interface became the de-facto portable communication standard in distributed deep learning.
Distributed ML algorithms
Distributed ML algorithms are part of large-scale learning which has received considerable attention over the last few years, thanks to its ability to allocate learning process onto several workstations — distributed computing to scale up learning algorithms. It is these advances which make ML tasks on big data scalable, flexible and efficient. There are two approaches to distributed learning algorithms. The distributed nature of these datasets can lead to the two most common types of data fragmentation:
- Horizontal fragmentation where subsets of instances are stored at different sites
- Vertical fragmentation where subsets of attributes of instances are stored at different sites
Some of the most common scenarios are where distributed ML algorithms are deployed are in healthcare or advertising where a simple application can accumulate a lot of data. Since data is huge, programmers frequently re-train data so as not to interrupt the workflow and use parallel loading. For example, MapReduce was built to allow automatic parallelisation and distribution of large-scale special-purpose computations that process large amounts of raw data, such as crawled documents or web request logs and compute various kinds of derived data.
One of the most widely-used distributed data processing systems for ML workloads is Apache Spark MLlib and Apache Mahout. Microsoft also released its Distributed ML Toolkit (DMTK), which contains both algorithmic and system innovations. Microsoft’s DMTK framework supports unified interface for data parallelisation, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency. System innovations and ML innovations are pushing the frontiers of distributed ML.
Unfortunately writing and running a distributed ML algorithm is highly complicated and developing distributed ML packages becomes difficult because of platform dependency. On the other hand, there are no standardised measures to evaluate distributed algorithms. Many ML researchers say that existing measures benchmarked against classical ML methods show less reliability.
Techniques applied in distributed deep learning are converging to the point where a standard programming interface (or framework) can be designed. In the future, ecosystems such as Ease.ml may make the definition of a training scheme (e.g., with respect to centralization and gradient consistency) easier, hiding most of the low-level infrastructure setup. Combining the increasing support for cloud systems and elastic training (where nodes can be spun up and removed at will) with the latest developments in evolutionary algorithms, we may see adaptive and financially-viable optimization methods rising to prominence
USC Viterbi Researchers Win $2 Million DARPA Grant distributed deep model training
Now, three Ming Hsieh Department of Electrical and Computer Engineering professors have won a 4-year, $2 million DARPA grant to address the problem. Salman Avestimehr, principal investigator, along with Professors Murali Annavaram and Mahdi Soltanolkotabi will be working on new algorithmic solutions to better test and train machine learning models for artificial intelligence. The overarching goal of this project, named DIAMOND (Distributed Training of Massive Models at Bandwidth Frontiers), is to leverage algorithmic, hardware, and system implementation innovations to break this barrier in distributed deep model training and enable two orders of magnitude speedup. These speedups are only possible because the team will be exploring radically new hardware, models, and approaches to DNN training.
Improving the speed at which DNNs can be trained and tested cannot be understated. Consider this hypothetical: A company has developed a technology which can be implemented at airports. It can scan people’s faces as they arrive and identify those who may be infected with coronavirus. For a technology like this to work, a DNN is needed. And this brings us back to that design, test, analyze, repeat pattern. In our current system, it may take engineers weeks or months to complete the process of designing, testing and redesigning the DNN. In the time it would take to train the network to handle that data and think on its own, it would be too late. In fact, motivated by the above example, the USC team also plans to develop an application scenario, named ADROIT (Anomaly Detector using Relationships over Image Traces), to demonstrate the societal impact of their innovations.
Engineering has always been, in part, about innovation. Something is made, tested, and improved upon. Without innovation, there can be no true progress – for engineers or the society in which they operate. This is not lost on Avestimehr. “Innovation can only happen as fast as you can test and analyze your design – this process is the fuel for A.I. data science,” he says. Yes, the work of these researchers could speed up the rate at which we train and test DNNs, bring down the cost of training significantly, and ultimately bring technology to society faster. But there is something more important at stake as well: Ensuring the continued ability to innovate new machine learning models and the Artificial Intelligence that relies on them.