Ideas that someone is working on
 Can we make deep networks incremental by training multiple networks in parallel, each with different depth, and passing the learning laterally among them?
 Since convolutional layers are a special case of fullyconnected layers, what if we just dampen the learning rate on the nonintersecting weights. Would this give us the power of convolutional layers with the flexibility of fullyconnected layers?
 A study is needed to determine how much sparsity in connected layers is optimal.
 Does our Fourier nets algorithm work better when multidimensional series data is used for training?
Ideas that may (or may not) have publication potential
 Can we make a layer that does cosinesimilarity or Pearson's correlation instead of dot product with the weights to enable a network that naturally supports sparse input vectors?
 A paper is needed contrasting the capabilities and strengths of full autoencoders versus inferring intrinsic vectors from just a decoder.
 In neural net training, initialization is important because the gradient depends on the initial weights, and the weights only learn as far as the gradient is intelligent. It seems a shame that there is only one shot to get good initial values. The following algorithm attempts to fix that:
 Let d = 0.1;
 Let m be a vector of size equal to the number of weights.
 Fill m with values from a Gaussian distribution with deviation d
 Add m to the weights
 Use backpropagation to update all the weights
 Subtract m from all the weights
 Decay d a little bit
 If not yet converged, go to step 3
 How many latent degrees of freedom can we really model with visual recognition problems? What stops us from pushing this further?
 Would autoencoders work better if we started with only a single point in the dataset, and gradually added more points? What if we start at a medoid point, then add its nearest neighbors, in a breadthfirst manner? Should we add more units as the training data gets larger, since existing units will already specialize on the earlier data? What kind of schedule is best for this process?
 Taken's theorem http://www.youtube.com/watch?v=6i57udsPKms seems to imply that when modeling timeseries data, it might be better to augment the dimensionality with several timeshifted variants of each existing dimension. Does this work in practice? If so, then perhaps it would be better to design a learning algorithm that jitters time while learning (while utilizes its knowledge of how much it jittered each instance).
 For modeling dynamical systems, if we augment each data point with its trajectory (and possibly subsequent derivatives), then nearestneighbors will be less likely to shortcut join with points travelling in a different directly. If we then use CycleCut, we should arrive at a manifold that can be unfolded. Would this lead to better models of dynamical systems?
Ideas that may help to breed more ideas
 It seems that intrinsic points need to band together like stretchy slime in order to produce continuous manifold representations. Can an algorithmic simulatedslime solve the intrinsic fragmentation problems that plague unsupervised inference with neural nets?
 Could the rubberbanding idea be used to refine static item or user profile classifications in recommender system data?
 Can we come up with better ways to make deep network training incremental?
 For some reason, I get better accuracy when I use $$x/sqrt(x^2+0.1)$$ as an activation function instead of $$x/sqrt(x^2+1)$$. Why? What does that tell us?
 Can the GAssociative class solve the problem of inputting an image one pixel atatime?
 Build a reinforcement learning system that presents video "fantasies" to an oracle to guide an agent's development with visual environments.
 Can latent inputs be used to improve the effectiveness of timeseries prediction?
 Can we find a way to demonstrate that my homeostasis model is more effective or useful or practical than objective planning?
 Test whether collaborative filters make better timeseries predictors than supervised learners. Specifically, can it make use of multiple layers?
 Forward differencing can reduce polynomial series to a constant. What about forward division? What if we sweep all combinations of forwarddifferencing or forwarddivision until we find the leaf node closest to constant?
 Follow through with my manifold blending algorithm.
 Does HyperNEAT essentially do the same thing as weightsharing in convolutional networks? Wouldn't HyperNEAT be superior?
 How much does it speed up deep network training to use a manifold learning algorithm to initialize the intrinsic vectors?
 Can we use existing NLDR algorithms to guide deep network training? Intuitively, it seems like this should probably work very well.
 Can Manifold Sculpting be used to train deep generative networks? (That is, init to the identity function and scale down the dimensionality of the intrinsics.)
 Can some manifold learning technique be adjusted to form a training method for a stackable autoencoder?
 Is my satisficing approach to knn pruning novel?
 If the learning rate is dynamically adapted against the training set and the regularization term is dynamically adapted against a holdout set, does that somehow enable them to tuned together without stepping on each other?
 Does my proposed method for pruning a knn model into a knearest rules model work as anticipated?
 Can our neural networks do well at predicting the Bitcoin market? Can they beat recurrent methods?
 Can Mike Smith's hardness metric be combined with Boosting to make a more intelligent booster that decreases weight for likelynoise even when misclassified?
 It seems intuitive that in evolutionary optimization, tournament selection should favor the winner to a greater extent if the two genes are similar. In such cases, choosing between them serves only to seek a local optimum. Alternatively, if gradientbased refinement is one of the operations, then similar genes should immediately kill each other when discovered. Have these notions been recognized yet?
 Does audio compressed with our generalizing version of the Fourier transform exhibit less underwatereffect than audio compressed with the FFT?
 Test idea regarding using integral of logistic function (softplus) as an activation function for the purpose of timewarping in timeseries prediction problems. (Published in paper titled "Training Deep Fourier Neural Networks To Fit TimeSeries Data")
 I am convinced that softplus makes a better activation function than logistic or other sigmoids. Arguments include: 2 softplus can approximate logistic very closely,softplus has only one bend (so undesired bends are not injected), and this builds on existing insights pertaining to rectified linear units. (Published in paper titled "Training Deep Fourier Neural Networks To Fit TimeSeries Data")
 Can our Fourier nets algorithm be made incremental? Does that improve results over time, and with larger data sets? (Luke made it incremental, it did indeed improve results, and he is working to publish it now.)
 To handle datasets where the features are not entirely trustworthy, what if we allow the network to update the features during training, but essentially attach them to the given values by "rubberbands"? Would this facilitate better robustness to noise? How could we learn the rubberband strength? (Stephen tested this. He determined that it didn't improve much.)
 Can we find a way to dynamically tune weight decay on a perunit basis, instead of just applying it linearly to all weights?
 Can we publish something about using latent variables as "rubberbands" to compensate for labels that do not perfectly align with events in timeseries data? For example, this might find application in training EEG sensors.
 Test and refine our weightbleeding algorithm. (Stephen tested this. Improvements turned out to be due to underflow occurring in our baseline. After fixing that, no advantage to bleeding was found.)
 Test whether picking the best model combination really approximates Bayesian model combination as well as picking the best model approximates Bayesian model averaging. (I assigned my data mining class to test this. Results consistently indicated that bombing did not outperform bagging, which suggests that it is not equivalent to BMC. This implies that BMC must be effective for reasons not contained in bombing. Hence, I am not convinced that BMC is yet a wellunderstood ensemble method.)
 Can we analyze the cutting sound to do materials detection, instead of relying on a spectrometer? (We lost interest in materials cutting.)
 Fix up my Lagrange multiplier training/regularization method for neural nets.
 Is it advantageous to update individual layers of an MLP separately, rather than doing them altogether? What if we fully train one layer, then run the data through it before starting on the next layer? Will this improve training speed?
 Test deep nets with autolearning rate per layer (train a faster and slower rate in parallel, and periodically go with best one)
 Test deep nets with training each layer separately versus together versus separateformanyiters.
 How does our method for averaging weights in an MLP work for evolutionary optimization of neural networks?
 We know deep networks are not theoretically invertible, but if realworld manifolds are nearly linear, or at least pseudomonotonic, they might be practically invertible. We could test this by training a generative MLP on handwritten digits.
 In a linear model, weights are updated as $$w=w+\eta(y(mx+b))x$$. Note that if $$\eta x$$ is ever greater than 1 or less than 1, then it will overshoot the target. I suspect this explains why linear models (and models with a linear component, such as ReLU) tend to be so unstable. Would clipping the second occurrence of x in the weight update equation to fall between 1 and 1 fix this instability?
 Can a recurrent neural network serve as an encoder for an image by accepting one random pixel atatime?
