There is real information loss caused by the non-linear activation functions in neural networks:
With no activation function (linear network) there is no information loss and a deep linear network is equivalent to a shallow one. Apart that is, from numeric rounding errors which can compound over a number of layers.
Assuming a significantly non-linear activation function then after a few layers the input information is washed out and the network goes on a set trajectory where it is no longer able to extract any further information from the input to make any further decisions. A way to fix that is every layer or every few layers you should have some weights connecting back to the input data (if that option will work with back-propagation.)
You can have a chaos theory view of neural nets. Where the non-linear behavior of the net compounds (as in compound interest) layer after layer. The weighted sum operations only being able to partially cancel out the non-linear behavior.
Then bifurcations become decision boundaries separating attractor states. There would also exist unwanted (excess) bifurcations would also explain why deep networks are susceptible to small (carefully selected) adversarial changes to the input that can cause gross classification errors.
There is also the work of Tishby https://youtu.be/RKvS958AqGY
who indicates that when using SGD a diffusion effect occurs. I understand this to mean that the decision regions grow to the maximum extent possible consistent with the training data by diffusion, resulting in the good generalization ability of deep nets.