Generative Data Intelligence

Decoupled Neural Interfaces Using Synthetic Gradients

Date:

This graph shows the application of an RNN trained on next character prediction on Penn Treebank, a language modelling problem. On the y-axis the bits-per-character (BPC) is given, where smaller is better. The x-axis is the number of characters seen by the model as training progresses. The dotted blue, red and grey lines are RNNs trained with truncated BPTT, unrolled for 8 steps, 20 steps and 40 steps – the higher the number of steps the RNN is unrolled before performing backpropagation through time, the better the model is, but the slower it trains. When DNI is used on the RNN unrolled 8 steps (solid blue line) the RNN is able to capture the long term dependency of the 40-step model, but is trained twice as fast (both in terms of data and wall clock time on a regular desktop machine with a single GPU).

To reiterate, adding synthetic gradient models allows us to decouple the updates between two parts of a network. DNI can also be applied on hierarchical RNN models – system of two (or more) RNNs running at different timescales. As we show in the paper, DNI significantly improves the training speed of these models by enabling the update rate of higher level modules.

Hopefully from the explanations in this post, and a brief look at some of the experiments we report in the paper it is evident that it is possible to create decoupled neural interfaces. This is done by creating a synthetic gradient model which takes in local information and predicts what the error gradient will be. At a high level, this can be thought of as a communication protocol between two modules. One module sends a message (current activations), another one receives the message, and evaluates it using a model of utility (the synthetic gradient model). The model of utility allows the receiver to provide instant feedback (synthetic gradient) to the sender, rather than having to wait for the evaluation of the true utility of the message (via backpropagation). This framework can also be thought about from an error critic point of view [Werbos] and is similar in flavour to using a critic in reinforcement learning [Baxter].

These decoupled neural interfaces allow distributed training of networks, enhance the temporal dependency learnt with RNNs, and speed up hierarchical RNN systems. We’re excited to explore what the future holds for DNI, as we think this is going to be an important basis for opening up more modular, decoupled, and asynchronous model architectures. Finally, there are lots more details, tricks, and full experiments which you can find in the paper here.

Source: https://deepmind.com/blog/article/decoupled-neural-networks-using-synthetic-gradients

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?