Next lab meeting: Mar 28, 2018
Candidate papers
Notes from past discussions
 Primary contribution: Iteratively trained by progressively upscaling the resolution with each new layer.
 Avoid "shocking" the system by slowly interpolating in new layers with a knob on a skip connection.
 They've got some way of dynamically tuning the learning rate or weights based on a He 2015 paper.
 Awesome video and some nice pics to go with the paper.
 This trains a NN to model a probability distribution.
 (It seems a lot like jitter, but somewhat more intelligent. The uncertainty is builtin.)
 Uses two terms: one makes it fit the target values, one makes it not ignore the random inputs.
 Impressive that they published in NIPS without a lot of validation.
 Strong theoretical basis. The whole paper is described by just a few equations.
 Kind of the opposite of a GAN. GAN samples a distribution. This models a distribution.
 Used reinforcement learning to predict NN architectures.
 Not clear why reinforcement learning is important. Could possibly use function approximation to accomplish the same goal.
 Does beat a random search for topologies.
 Works really well for certain datasets.
 Not efficient or general or reproducible.
 Not clear how they encode each word. We talked about a potential encoding made from applying dimensionality reduction to WordNet distances.
 They used residual connections to make their network deeper.
 Bidirectional LSTM layers on the first layer. We discussed doing it on more layers.
 We couldn't figure out the attention module.
 We considered using a generative approach with a similar model.
 We think their training set is all divided into sentences and uses tokenized values.
 Summary: Normalize input to layers, and you get training with fewer epochs! (Enables larger learning rate, and more robust to initialization, and it implicitly regularizes.)
 Internal Covariate Shift is a problem where the inputs to a layer have changing distributions.
 Normalize the input to every layer and we can fix this problem. Subtract the mean, divide by the variance, and scale and shift.
 This is used with minibatches to get a better idea of the mean and variance of the batch.
 Summary: Train a network using the residual function, and then you can build very deep powerful neural networks.
 Insight 1: The shortcut connection on page 2 is how they add the residual back in.
 Insight 2: Probably don't want to do this on the input layer. That layer needs to sort of solve the problem a bit before we can begin stacking.
 Insight 3: Error travels along the shortcut connections back through the network, mostly eliminating the vanishing gradient problem.
 Insight 4: 1x1 convolutions are equivalent to a fully connected layer.
 Contribution: A simple residual block that can be used to build deep neural networks, based on a shortcut or skip connection.
 Contribution 2: Their networks performed so much better that they won several awards on image recognition and so forth.
 Summary: Superscaling images with generative adversarial networks.
 Insight 1: Perhaps we could use generative adversarian networks
 Contribution: Used a different loss function in combination with generative adversarial networks. No longer is it simple meansquared error as the loss function.
 Summary: This paper presents an autoencoder using recurrent connections and convolution to compress images.
 Insight 1: Recurrent connections and Convolution can give a lot of power to a neural network.
 Insight 2: Gated Recurrent Units are interesting, and will have less weights than LSTM.
 Insight 3: Does as well as JPG with an autoencoder.
 Insight 4: Multiplication in a neural network might be very important, or provide a lot of extra information.
 Contribution: Works on any size image due to the image patches, and has recurrent connections between those patches.
 Summary: This paper presents a deep neural network that can generate natural sounding audio.
 Insight 1: We believe that in figure 3, the bottom input is time.
 Insight 2: The Gated Activation Unit is essentially choosing how much of x to allow through a gate. We also believe that the added value from these GAU's is the fact that they use multiplication, which can possibly add extra power to the neural network. The paper that created these GAU's has similar claims.
 Results: This is a great paper, and an interesting approach with a lot of ideas that we could use. There are some strange things to it however.
 Summary: Paper does a lot of math to go from Inverse Reinforcement Learning to Generative Adversarial networks. They use a strange gradient to train the networks. They use networks of only two layers with 100 units each.
 Insight 1: "We first generated expert behavior for these tasks by running TRPO on these true cost functions to create expert policies." This essentially means their "expert" was generated by another algorithm (TRPO).
 Results: The results appear good, but given Insight 1 might be more of a strawman argument.
 Summary: Did an autoencoder and transition model approach to learn how to control a "robotic arm".
 This paper starts with a lot of literature review covering reinforcement learning, and more; and is pretty thorough.
 Problem Set Up: Minimize the cost according to some function.
 Insight 1: They use the last k timesteps as input to the transition function (which they call the prediction model).
 Equation 3a: Feed in the observations to the encoder to get the intrinsic beliefs(z). Equation 4: The future predicted error is the difference between what you see in the next time step (observation) and what you predicted.
 Equation 5: Essentially minimize the sum squared error. Interesting notation as they were able to pull three equations out of it.
 Insight 2: Simultaneously train the transition and observation models together.
 Very similar to Stephen and Mike's robot arm work.
 Insight 3: Regularization seems to help. Regularizing it can help it settle into a simple representation.
 Results: Only tested on very simple simulations including a pendulum experiment.
 Summary: Did a deep Qnetwork with continuous actions to control simulated robots.
 This paper starts with a lot of general equations describing the Bellman equation and Qlearning.
 Insight 1: (Equation 6) backpropagate from the critic to train the actor. (The paper admits that this was not really a novel contribution.)
 Insight 2: Avoid divergence by using a copy of the actor and critic that lags or slowly tracks the updated ones.
 Insight 3: Use batch normalization.
 Insight 4: Do exploration by adding noise to what the actor said to do.
 We decided that insight 2 is probably the main contribution of this paper.
