There is real information loss caused by the non-linear activation functions in neural networks:

With no activation function (linear network) there is no information loss and a deep linear network is equivalent to a shallow one. Apart that is, from numeric rounding errors which can compound over a number of layers.

Assuming a significantly non-linear activation function then after a few layers the input information is washed out and the network goes on a set trajectory where it is no longer able to extract any further information from the input to make any further decisions. A way to fix that is every layer or every few layers you should have some weights connecting back to the input data (if that option will work with back-propagation.)

You can have a chaos theory view of neural nets. Where the non-linear behavior of the net compounds (as in compound interest) layer after layer. The weighted sum operations only being able to partially cancel out the non-linear behavior.

Then bifurcations become decision boundaries separating attractor states. There would also exist unwanted (excess) bifurcations would also explain why deep networks are susceptible to small (carefully selected) adversarial changes to the input that can cause gross classification errors.

There is also the work of Tishby https://youtu.be/RKvS958AqGY

who indicates that when using SGD a diffusion effect occurs. I understand this to mean that the decision regions grow to the maximum extent possible consistent with the training data by diffusion, resulting in the good generalization ability of deep nets.

Another way of looking at the information loss per layer is to say the attractor states cause partial quantization, as mapping (rounding) 1.3384723 to 1 causes a loss of information.

ReplyDeleteTo avoid the impact of adversarial inputs you have an ensemble of lean diverse networks. If the input lands on one of the unwanted excess bifurctions in one network it is unlikely to land on a similar bifurcation in any of the other networks.

ReplyDeleteI guess you can use random projections to help you create diverse networks.

You can also see this paper:

ReplyDeletehttps://arxiv.org/abs/1606.05336

Chaos and neural networks: https://arxiv.org/pdf/1712.08969.pdf

ReplyDeleteMore chaos: https://arxiv.org/abs/1712.09913

ReplyDelete