Skip to main content

A few thoughts about neural networks.

There is real information loss caused by the non-linear activation functions in neural networks:
With no activation function (linear network) there is no information loss and a deep linear network is equivalent to a shallow one.  Apart that is, from numeric rounding errors which can compound over a number of layers.

Assuming a significantly non-linear activation function then after a few layers the input information is washed out and the network goes on a set trajectory where it is no longer able to extract any further information from the input to make any further decisions.  A way to fix that is every layer or every few layers you should have some weights connecting back to the input data (if that option will work with back-propagation.)

You can have a chaos theory view of neural nets.  Where the non-linear behavior of the net compounds (as in compound interest) layer after layer.   The weighted sum operations only being able to partially cancel out the non-linear behavior.
Then bifurcations become decision boundaries separating attractor states. There would also exist unwanted (excess) bifurcations would also explain why deep networks are susceptible to small (carefully selected) adversarial changes to the input that can cause gross classification errors.  

There is also the work of Tishby https://youtu.be/RKvS958AqGY
who indicates that when using SGD a diffusion effect occurs.  I understand this to mean that the decision regions grow to the maximum extent possible consistent with the training data by diffusion, resulting in the good generalization ability of deep nets.

Comments

  1. Another way of looking at the information loss per layer is to say the attractor states cause partial quantization, as mapping (rounding) 1.3384723 to 1 causes a loss of information.

    ReplyDelete
  2. To avoid the impact of adversarial inputs you have an ensemble of lean diverse networks. If the input lands on one of the unwanted excess bifurctions in one network it is unlikely to land on a similar bifurcation in any of the other networks.
    I guess you can use random projections to help you create diverse networks.

    ReplyDelete
  3. You can also see this paper:
    https://arxiv.org/abs/1606.05336

    ReplyDelete
  4. Chaos and neural networks: https://arxiv.org/pdf/1712.08969.pdf

    ReplyDelete
  5. More chaos: https://arxiv.org/abs/1712.09913

    ReplyDelete

Post a Comment

Popular posts from this blog

Neural Network Weight Sharing using Random Projections

If you have a weight vector and take multiple different vector random projections of that data you can use those as weights instead for a neural network. 
The price you pay is Gaussian noise term that limits the numerical precision of your new enlarged weight set. 
However with the correct training algorithm some of the weights can be very high precision at the expense of making others less precise (higher Gaussian noise.)
Vector random projections can be invertible if your training algorithm needs that (probably unless you are using evolution.)
Also you can use the same idea for other algorithms than could benefit from variable precision parameters.

Fast random projection code:
https://github.com/S6Regen/NativeJavaWHT
You can create an inverse random projection by changing the order of the operations in the random projection code.

Double weighting for neural networks

The weight and sum operation for a neural network is a dot product.
The Walsh Hadamard transform is a collection of dot product operations.  
The Walsh Hadamard transform connects every single input point to the entirety of output points. The weighted sum of number of dot products is still a dot product.

The idea is to weight the inputs to n Walsh Hadamard transforms and then weight their outputs.  After running the input vector through the n double weighted transforms you sum together each of the corresponding dimensions and use that as the input to the neuron activation function.  Thus each neuron accounts for 2n weight parameters.  The number of neurons is the order of the transform. That makes the network fully connected on a layer basis with only a limited number of weights.  It should also allow the network to pick out regularities that maybe would otherwise require time consuming correlation operations. 
If wi are weight vectors and WHT is the transform and x the input then sa…

Probing deep neural networks

Probing randomly initialized deep neural networks with 2D variations. Depth=number of layers.