Welcome back, in this lecture, we will look at multi-node distributed training. Which allows us to efficiently parallelize deep networks across multiple servers, in order to minimize the time to train. Specifically, we will look at scaling challenges, data and model parallelism. Allreduce approaches, and synchronous and asynchronous multi-node training and trade-offs. Before we do, let's review gradient descent and stochastic gradient descent. When using gradient descent to find a local minimum, we take steps proportional to the negative of the gradient of the cost, with respect to the weight. However, there are three main issues with gradient descent. First, in gradient descent, you have to run through all of the samples in your training set to do a single update for a parameter in a particular iteration. Second, gradient descent gets stuck at saddle point, which are very common in high-dimensional spaces. AlexNet, for example, has 60 million dimensions, making saddle point a large issue. Finally, gradient descent is thought to converge to sharp minimums more frequently. This is problematic, as even a slight variations between the training and test objective functions can cause dramatic shifts in their values, when in sharp minimum. On the other hand, a stochastic gradient descent solves many of these issues. By breaking the training data set into minibatches, each of which are then used to compute a parameter update. This algorithm is more explanatory than gradient descent, converges to flatter minimum. Does not get caught in saddle points points as easily, and is much more computationally efficient. As seen here, in gradient descent, to make a word update, one must go through entire training data set, which can be very slow. On the other hand, stochastic gradient descent makes an update after each minibatch, which has a dramatic effect on efficiency. Nearly all, if not all, deep learning models are trained with some variant of a stochastic gradient descent, for these reasons. We'll now transition to discussing scale challenges. When using SGD, choosing the right batch size is very important. For smaller batch sizes, one does not efficiently utilize the computational resources. Yet for larger batch sizes, one can run into similar issues that we explored with gradient descent. Very slow progress near saddle-like points, and getting stuck in sharp minima. With very deep networks, trained on large data set, efficiently parallelizing networks across multiple servers becomes essential, in order to minimize the time to train. To this end, we will explore two algorithms, data parallelism and model parallelism. In model parallelism, we split the model weights among end nodes, with each node working on the same minibatch. With data parallelism, we use the same model with every node, but feed it different parcels of data. Such a method is better for networks with few weights, like GoogLeNet, and is currently used in Intel Optimized Caffe. Visually, we can see that with data parallelism, the algorithm distributes the data between various nodes. And each node independently tries to estimate the same parameters, then they exchange their estimates with each layer. Using a parameter server or an AllReduce method, as we will discuss, to come up with the right estimate for the step. While the minibatch is distributed or mini nodes, three in this example, one can not simply increase the byte size by three times. As the time to train increases with larger minibatches, due to similar issues as with gradient descent. With model parallelism, the algorithm sends the same data to all nodes, but each node is responsible for estimating different parameters. These nodes then exchange their estimates with each other, to come up with the right estimates for all the parameters. Data parallelism is preferred when the number of weights is small, which is true for linear topologies. When updating weights on a single node using stochastic grading descent, we pass in x training examples, where x is a batch size, and forward-propagate them through the network. After computing the cost of all examples, we then compute the data gradient from layer to layer, and update the model weights using the weight gradients. When we have multiple nodes, 32 in this example, and are using data parallelism, we partition the minibatch into 32 subsections, and distribute them to 32 workers. When back-propagating, each worker computes a sum of the weight gradient for each subset of their batch. The weight gradients are then summed across workers, producing identical numerical result as one would find with a single node, training with a large batch size. Deep learning practitioners have demonstrated a scaling across various nodes. Baidu distrubuted training to 40 GPU nodes, later that year, UC Berkeley scaled training to 120 GPU nodes. Their paper provided sufficient details for other practitioners to build upon their work. A few months later, Intel demonstrated scaling to 128 CPUs, Google to 120 GPUs, Amazon to 120 GPUs. Most recently, and not shown in this slide, Facebook demonstrated near-linear scaling to 256 GPUs, reducing the time to train from several days to just one hour. With very large batch sizes, the time to train becomes quite large, making training slow and not able to reach the same accuracy. Therefore, let's assume that we have a batch size of 1024, how can we distribute the data across nodes? One option is to have 1024 nodes, each node with a batch size 1. However, with this arrangement, the communication between the nodes becomes a bottleneck, and the computation itself on each node is too little. On the other hand, using 16 nodes, each with a batch size of 64, is more reasonable, as most of the communication is hidden in the computation. Multi-node training on IntelCaffe, which uses data parallelism, works in the following manner. First, the data on a given node is forward-propagated through the network, which in this case is composed of two layers, L1 and L2. Then the L2 gradients are sent to the parameter server after that layer has been propagated through. Similarly, the L1 gradients are sent subsequently to the server after L1 has been back-propagated through. When the server receives the L2 gradings from all nodes, it then applies an update, and broadcasts it to all the nodes, and likewise with the L1 gradients. Nodes wait for these updates before forward-propagating through the updated network. Now that we have discussed how data and model parallelism work, we will consider strategies for implementing gradient aggregation. Parameter server, reduction trees, rings, and butterfly. One strategy for communicating gradients is to appoint one node as the parameter server. Which computes the sum of the communicated gradients, and sends the updates to each of the workers. However, there is a bottleneck in sending and receiving all of the gradients with just one parameter server. Another strategy is an AllReduce tree. An AllReduce communication method is where each worker produces one or more data values that must be globally reduced. Generally, we'd have commutative, binary element-wise operator, to produce a single result value. And then this single value must be broadcast to all workers before they can continue. In an AllReduce tree, the local gradient information is distributed to the entire network using a tree-based algorithm, which is then broadcasted to each individual node. In this example, there are seven nodes, and each has a gradient value between one and seven. In this AllReduce tree example, where the goal is to sum all of its values. 1, 2, and 5 are summed to 8, and 3, 4, and 6 are summed to 13, in the first step. Then these results, 8 and 13, are summed with 7 to make 28, in the next step. Finally, 28 is then broadcasted to the rest of the network in just two steps. In total, if N is the number of nodes, the time is a function of the log of n, which makes it ideal for power of two number of nodes- 1. All Reduce butterfly requires the least of number of steps, as it does not require separate reduce and broadcast steps. In this example, there are four nodes with three steps, starting from the top, and how the communication happens at each step. As shown here in step one, node 1 and 2 are first summed concurrently with 3 and 4. Resulting in the first two nodes with the values 3, and the next two nodes with value 7. In step two, the nodes with 3 and 7 communicate, resulting in all the nodes with a value of 10. The complexity of this algorithm is also log 2 of N, but requires half of the steps as AllReduce tree. A ring All Reduce algorithm is useful as the communication cost is constant, and independent of the number of nodes in the system. And is determined solely by the slowest connection. Each node will only ever send data to its right neighbor, and receive data from its left neighbor. The algorithm proceeds in two steps. First, a scalar reduce to exchange data, and then an all-gather to communicate the final result. While we have looked at synchronous multi-node training, we will now briefly discuss asynchronous multi-node training, and its trade-offs. In order to accelerate the convergence of SGD, and improve the training speed, asynchronous parallelization of stochastic gradient descent has been investigated. While asynchronous SGD overcomes communication delays, it comes with a myriad of problems. The algorithm requires more tuning of hyperparameters such as momentum and learning rate, and requires more epochs to train. Furthermore, it does not match single node performance. It's quite hard to debug, and has not been shown to scale and retain accuracy on large models. Therefore, while there do exist benefits of asynchronous SGD, there exist many issues that have not yet been fully addressed. In this lecture, we have covered multi-node distributed training, which is very useful to train deep networks on large data sets. To that end, we discussed scaling challenges, data parallelism and model parallelism, AllReduce algorithms, and asynchronous stochastic gradient descent. During the next lecture, we will look into some hot topics in deep learning research. [MUSIC]