Numerical Stability and Initialization#
Configuration#
1@dataclass
2class SeedAll:
3 seed: int = field(default=1992, metadata={"help:": "Seed for reproducibility"})
4 seed_torch: bool = field(default=True, metadata={"help:": "Seed torch for reproducibility"})
5
6@dataclass
7class Config:
8 seed_all: SeedAll = field(default_factory=SeedAll)
9
10config = Config()
1def seed_all(seed: Optional[int] = 1992, seed_torch: bool = True) -> int:
2 """
3 Seed all random number generators.
4
5 Parameters
6 ----------
7 seed : int, optional
8 Seed number to be used, by default 1992.
9 seed_torch : bool, optional
10 Whether to seed PyTorch or not, by default True.
11 """
12 # fmt: off
13 os.environ["PYTHONHASHSEED"] = str(seed) # set PYTHONHASHSEED env var at fixed value
14 np.random.seed(seed) # numpy pseudo-random generator
15 random.seed(seed) # python's built-in pseudo-random generator
16
17 if seed_torch:
18 torch.manual_seed(seed)
19 torch.cuda.manual_seed_all(seed) # torch.manual_seed may call manual_seed_all but calling it again here
20 # to make sure it gets called at least once
21 torch.backends.cudnn.deterministic = True
22 torch.backends.cudnn.benchmark = False
23 torch.backends.cudnn.enabled = False
24 # fmt: on
25 return seed
26
27_ = seed_all(**asdict(config.seed_all))
28
29device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
30pprint(device)
Notations and Definitions#
\(\mathbf{X} \in \mathbb{R}^{n \times d}\) is a minibatch of \(n\) samples with \(d\) features each.
\(\mathbf{H} \in \mathbb{R}^{n \times h}\) is the output of the hidden layer (hidden representations).
\(\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}\) and \(\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}\) are the weights and biases for the hidden layer.
\(\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}\) and \(\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}\) are the weights and biases for the output layer.
\(\sigma(\cdot)\) is the activation function applied element-wise.
Note in particular we denote the bias vector as a row vector because it can be broadcasted to the shape of the hidden layer output. For example, \(\mathbf{X}\mathbf{W}^{(1)} \in \mathbb{R}^{n \times h}\) and \(\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}\), so \(\mathbf{X}\mathbf{W}^{(1)} + \mathbf{b}^{(1)}\) is a valid operation as the bias vector is broadcasted to the shape of \(\mathbf{X}\mathbf{W}^{(1)}\). More concretely, if \(\mathbf{b}^{(1)} = [b_1, b_2, \ldots, b_h]\), then \(\mathbf{X}\mathbf{W}^{(1)} + \mathbf{b}^{(1)}\) is equivalent to \(\mathbf{X}\mathbf{W}^{(1)} + \begin{bmatrix} b_1 & b_2 & \cdots & b_h \end{bmatrix}\).
The bias term in a neural network layer typically has a shape that matches the number of units in that layer, not the number of samples. In the context of a fully connected layer, if the layer has \(h\) hidden units, then the bias term will have a shape of \((h,)\).
When we perform the forward pass, the bias term is broadcasted to match the shape of the output. For example, if we have an input \(\mathbf{X}\) with shape \((n, d)\) and a weight matrix \(\mathbf{W}^{(1)}\) with shape \((d, h)\), the output before activation will be \(\mathbf{X} \mathbf{W}^{(1)}\) with shape \((n, h)\). The bias \(\mathbf{b}^{(1)}\) with shape \((h,)\) will be broadcasted to \((n, h)\) during the addition operation.
So, in the equation \(\mathbf{H} = \sigma(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})\), the \(\mathbf{b}^{(1)}\) term is implicitly broadcasted to have the same shape as \(\mathbf{X} \mathbf{W}^{(1)}\), which is \((n, h)\).
Neural Networks are Permutation Symmetrical#
Permutation symmetry is a mathematical property that has intriguing implications for the design and training of neural networks. In this section, we’ll explore what permutation symmetry is, provide an example to illustrate the concept, and then discuss its specific impact on neural network architectures. We’ll also delve into the challenges this symmetry presents during the training process and examine some strategies to mitigate them.
What is Permutation Symmetry?#
In mathematics, permutation symmetry refers to a property of a function or equation that remains invariant under permutations of its variables. More formally, a function \(f\) is said to possess permutation symmetry if
for any permutation \(\pi\) of the indices \(\{1, 2, \ldots, n\}\).
Example: Summation Function#
Consider a function \(f\) defined as:
where \(\mathbf{x} = [x_1, x_2, \ldots, x_n]\).
This function exhibits permutation symmetry because the sum remains the same regardless of the order of the elements in \(\mathbf{x}\). Formally, for any permutation \(\pi\) of the indices \(\{1, 2, \ldots, n\}\), we have:
Permutation Matrix#
To understand the permutation symmetry in neural networks more concretely, it’s useful to introduce the concept of a permutation matrix. It is a square binary matrix that has exactly one entry of 1 in each row and each column and 0s elsewhere. When we multiply another matrix by a permutation matrix, the effect is to permute the rows or columns of the matrix. For example, consider the following permutation matrix \(\mathbf{P}\):
If we multiply this matrix with another matrix \(\mathbf{A}\), it will swap the first and second rows of \(\mathbf{A}\). Specifically, \(\mathbf{P} \mathbf{A}\) will permute the rows of \(\mathbf{A}\), and \(\mathbf{A} \mathbf{P}\) will permute the columns of \(\mathbf{A}\).
For example, let \(\mathbf{A}\) be:
The product \(\mathbf{P} \mathbf{A}\) would be:
As we can see, the first and second rows of \(\mathbf{A}\) have been swapped.
Reverse Operation#
If you have a permutation matrix and you multiply it by a tensor to obtain a permuted tensor, then the operation can be reversed by multiplying the permuted tensor with the inverse of the permutation matrix.
However, the beauty of permutation matrices is that their inverse is simply their transpose. Thus, to reverse the operation, you can multiply the permuted tensor with the transpose of the permutation matrix.
If \(\mathbf{P}\) is your permutation matrix and \(\mathbf{A}\) is your tensor, then the permuted tensor is \(\mathbf{P}\mathbf{A}\). To reverse the operation on \(\mathbf{P}\mathbf{A}\), let’s first denote \(\mathbf{P}\mathbf{A}\) to be \(\mathbf{B}\) for better manipulation, then we can do the following:
where we simply just need to left multiply \(\mathbf{B}\) with the inverse of \(\mathbf{P}\) to get back \(\mathbf{A}\). However, the inverse of \(\mathbf{P}\) is simply its transpose, so we can rewrite the above as:
This potentially saves us a lot of computation time because we don’t need to compute the inverse of \(\mathbf{P}\).
Permutation Symmetry in Neural Networks#
Motivation#
The following two paragraphs are quoted from section 5.4. Numerical Stability and Initialization from Dive Into Deep Learning [Zhang et al., 2021]:
Another problem in neural network design is the symmetry inherent in their parametrization. Assume that we have a simple MLP with one hidden layer and two units. In this case, we could permute the weights \(\mathbf{W}^{(1)}\) of the first layer and likewise permute the weights of the output layer to obtain the same function. There is nothing special differentiating the first and second hidden units. In other words, we have permutation symmetry among the hidden units of each layer.
This is more than just a theoretical nuisance. Consider the aforementioned one-hidden-layer MLP with two hidden units. For illustration, suppose that the output layer transforms the two hidden units into only one output unit. Imagine what would happen if we initialized all the parameters of the hidden layer as \(\mathbf{W}^{(1)}=c\) for some constant \(c\). In this case, during forward propagation either hidden unit takes the same inputs and parameters producing the same activation which is fed to the output unit. During backpropagation, differentiating the output unit with respect to parameters \(\mathbf{W}^{(1)}\) gives a gradient all of whose elements take the same value. Thus, after gradient-based iteration (e.g., minibatch stochastic gradient descent), all the elements of \(\mathbf{W}^{(1)}\) still take the same value. Such iterations would never break the symmetry on their own and we might never be able to realize the network’s expressive power. The hidden layer would behave as if it had only a single unit. Note that while minibatch stochastic gradient descent would not break this symmetry, dropout regularization (to be introduced later) would!
In order to understand the phenomenon on constant weight initialization, let’s first consider the concept of permutation symmetry in neural networks. We will illustrate the concept using a simple example, and then discuss its implications for neural network design and training.
Running Example#
Consider a one-hidden-layer neural network (multilayer perceptron) with \(h\) hidden units, the forward pass is given by the following equations:
where
Equation \(1.1\) is the hidder layer activations \(\mathbf{H}\).
Equation \(1.2\) is the output layer activations \(\mathbf{O}\).
It is also worth noting that we often denote the pre-activation matrix as \(\mathbf{Z}\), so the equations above can be rewritten as:
Let’s denote the weight matrix for the first layer as \(\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}\) and the weight matrix for the output layer as \(\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}\).
To see how permutation symmetry manifests in the one layer neural network, let’s consider a simple case:
Input Matrix#
Denote \(\mathbf{X}\) as the input matrix, containing \(n=2\) samples with \(d=2\) individual features each. Let’s say the input matrix is:
where
Green: Represents the first sample \(\begin{bmatrix} \color{green}{1} & \color{green}{2} \end{bmatrix}\).
Blue: Represents the second sample \(\begin{bmatrix} \color{blue}{3} & \color{blue}{4} \end{bmatrix}\).
\(x_{ij}\): Represents the \(j\)-th feature of the \(i\)-th sample.
Weight and Bias Matrices#
For the hidden layer:
Orange, Purple, Brown: Represent the first, second, and third hidden neurons respectively.
Note in particular, each column of \(\mathbf{W}^{(1)}\) represents the weights for a single hidden neuron.
For the output layer:
Pink, Cyan: Represent columns in \(\mathbf{W}^{(2)}\) (and thus output neurons in the output layer).
Note in particular, each column of \(\mathbf{W}^{(2)}\) represents the weights for a single output neuron.
Forward Propagation of the Output Layer#
Given the hidden representation \(\mathbf{H}\):
And the weight matrix \(\mathbf{W}^{(2)}\) for the output layer:
The bias \(\mathbf{b}^{(2)}\) is:
Then, the output \(\mathbf{O}\) can be calculated as:
In the output layer, each hidden representation \(\mathbf{h}_i\) is mapped to an output \(\mathbf{o}_i\). This is accomplished by matrix multiplication with the weight matrix \(\mathbf{W}^{(2)}\) and addition of the bias \(\mathbf{b}^{(2)}\).
If we break it down, each output neuron calculates a weighted sum of all the features in the hidden representation \(\mathbf{h}_i\), adds a bias, and possibly applies an activation function (if we’re not in the final layer or if a specific activation function is desired for the output).
This setup allows the network to learn complex mappings from the hidden representation to the output, effectively enabling the model to make predictions or decisions based on the transformed input. Just like each hidden neuron captures different aspects of the input, each output neuron can capture different aspects of the hidden representation, making the network flexible and capable of modeling complex relationships.
Permuting the Weights#
Let’s permute \(\mathbf{W}^{(1)}\) and \(\mathbf{W}^{(2)}\) in a consistent manner. What is meant by consistent here? It means that we look for the same permutation in both matrices sharing the same index. For instance, \(\mathbf{W}^{(1)}\) and \(\mathbf{W}^{(2)}\) share the same index \(h\), so we permute the columns of \(\mathbf{W}^{(1)}\) and the rows of \(\mathbf{W}^{(2)}\) in the same way.
(Permutation Consistency)
The permutation operation must be consistent across layers that share the same index or dimension. In the case of a one-hidden-layer MLP, the shared index is \(h\), which is the number of hidden units.
When we permute the columns of \(\mathbf{W}^{(1)}\), we are effectively rearranging the hidden units. To maintain consistency, we must apply the same permutation to the rows of \(\mathbf{W}^{(2)}\) because both matrices share the \(h\) index, which corresponds to these hidden units.
However, permuting the rows of \(\mathbf{W}^{(1)}\) would mean rearranging the input features, which have the dimension \(d\). Since \(\mathbf{W}^{(2)}\) is of shape \(h \times q\), it doesn’t have a \(d\) dimension to permute in a corresponding manner. Therefore, an arbitrary permutation of the rows of \(\mathbf{W}^{(1)}\) would not have a corresponding permissible operation on \(\mathbf{W}^{(2)}\), breaking the symmetry.
Consider a permutation matrix \(\mathbf{P} \in \mathbb{R}^{h \times h}\) such that \(\mathbf{P} \mathbf{P}^T = \mathbf{I}\), where \(\mathbf{I}\) is the identity matrix. Then, the permuted weight matrices are given by:
We will see later why there is no need to permute the bias vector \(\mathbf{b}^{(2)}\).
(Example: Permuting the Weights)
Let’s assume \(\mathbf{P}\) is a permutation matrix that simply swaps the first and second columns (or rows). This is a simplified example to make the calculations more tractable. The permutation matrix \(\mathbf{P}\) would then look like:
With this permutation matrix, we can find the permuted weight matrices \(\mathbf{W}^{(1)}_{\text{perm}}\) and \(\mathbf{W}^{(2)}_{\text{perm}}\) as follows:
Notice that in \(\mathbf{W}^{(1)}_{\text{perm}}\), the first and second columns have been swapped, and similarly for the rows in \(\mathbf{W}^{(2)}_{\text{perm}}\).
The bias \(\mathbf{b}^{(1)}\) for the hidden layer can also be permuted in the same manner:
We noted earlier that there is no need to permute \(\mathbf{b}^{(2)}\), so it remains unchanged.
With these permuted weights and biases, we can proceed to perform the forward pass and backpropagation using the same methodology as with the original weights and biases.
Forward Propagation with Permuted Weights#
Forward Propagation of the Hidden Layer#
Hidden layer activations with permuted weights:
We seemed to do something illegal here. We factored out the permutation matrix \(\mathbf{P}\) inside the activation function \(\sigma\) even though \(\sigma\) can be non-linear. There are two reasons we can do that:
\(\sigma\) is applied element-wise so this preserves the shape of the matrix and;
\(\mathbf{P}\) is a permutation matrix. Let’s call \(\mathbf{Z} = \mathbf{X}\mathbf{W}^{(1)} + \mathbf{b}^{(1)}\). What we are doing here is saying that \(\sigma(\mathbf{Z}) \mathbf{P} = \sigma(\mathbf{Z} \mathbf{P})\). Since applying \(\mathbf{P}\) to \(\mathbf{Z}\) is just a permutation of the rows of \(\mathbf{Z}\). This means that if we apply sigma to \(\mathbf{Z}\) and then permute the rows (columns) of the result, we get the same result as if we permuted the rows (columns) of \(\mathbf{Z}\) with \(\mathbf{P}\) and then applied sigma.
1import numpy as np
2
3# Define the sigmoid activation function
4def sigmoid(x):
5 return 1 / (1 + np.exp(-x))
6
7# Define a sample matrix Z (3x3 for illustration)
8Z = np.array([[1, 2, 3],
9 [4, 5, 6],
10 [7, 8, 9]])
11
12# Define a 3x3 permutation matrix P that swaps the first and last rows
13P = np.array([[0, 0, 1],
14 [0, 1, 0],
15 [1, 0, 0]])
16
17# Compute sigma(Z)
18sigma_Z = sigmoid(Z)
19
20# Compute sigma(ZP)
21sigma_ZP = sigmoid(np.dot(Z, P))
22
23# Compute sigma(Z)P
24sigma_Z_P = np.dot(sigma_Z, P)
25
26# Check if sigma(ZP) and sigma(Z)P are the same
27are_matrices_equal = np.allclose(sigma_Z_P, sigma_ZP)
28
29sigma_Z, sigma_ZP, sigma_Z_P, are_matrices_equal
Let’s generalize the result using mathematical notation.
Consider a matrix \(\mathbf{Z}\) with dimensions \(m \times n\) and a permutation matrix \(\mathbf{P}\) with dimensions \(n \times n\).
Let’s populate \(\mathbf{Z}\) with generic elements \(z_{ij}\):
After applying the element-wise (and without loss of generality we assume the non-linear activation to be sigmoid) sigmoid function \(\sigma\), we get:
Now let’s consider the effect of the permutation matrix \(\mathbf{P}\) on \(\sigma(\mathbf{Z})\):
Since \(\mathbf{P}\) is a permutation matrix, \(\sigma(\mathbf{Z}) \mathbf{P}\) will result in a matrix where the columns of \(\sigma(\mathbf{Z})\) are permuted.
Similarly, \(\sigma(\mathbf{Z} \mathbf{P})\) will first permute the columns of \(\mathbf{Z}\) and then apply \(\sigma\) to each element. Because \(\sigma\) is an element-wise function, applying \(\sigma\) before or after the permutation yields the same result:
In summary, the property holds because the sigmoid function \(\sigma\) is applied element-wise, and the permutation matrix \(\mathbf{P}\) merely rearranges the elements without altering them.
Forward Propagation of the Output Layer#
Output layer activations with permuted weights:
(Example: Forward Propagation with Permuted Weights (Output Layer))
Let’s use the permuted weights and biases from the previous example to perform the forward pass. We’ll use the same hidden representation matrix \(\mathbf{H}\) as before. We will not directly apply the result we’ve obtained in Equation 1.7 and instead perform the forward pass from scratch to demonstrate the consistency of the result.
The output layer activations \(\mathbf{O}_{perm}\) with permuted weights are:
This is expected because we know from Equation 1.7 that \(\mathbf{O}_{perm} = \mathbf{O}\).
(Why Is There No Need to Permute the Bias in the Output Layer?)
In a one-hidden-layer MLP, both \(\mathbf{W}^{(1)}\) and \(\mathbf{b}^{(1)}\) pertain to the hidden layer, and their dimensions are directly related to the number of hidden units \(h\). Specifically, \(\mathbf{W}^{(1)}\) is of shape \(d \times h\) and \(\mathbf{b}^{(1)}\) is of shape \(h\). When we permute the columns of \(\mathbf{W}^{(1)}\), we’re effectively rearranging the hidden units. To maintain consistency, we must apply the same permutation to \(\mathbf{b}^{(1)}\), which also has dimensions related to \(h\).
On the other hand, \(\mathbf{b}^{(2)}\) is related to the output layer and its dimensions correspond to the number of output units \(q\), not the hidden units \(h\). Since we’re permuting the hidden units when we rearrange the columns of \(\mathbf{W}^{(1)}\) and the rows of \(\mathbf{W}^{(2)}\), there’s no need to apply a permutation to \(\mathbf{b}^{(2)}\) because it’s not directly tied to the hidden units \(h\). In essence, the bias in the output layer \(\mathbf{b}^{(2)}\) doesn’t have a ‘pair’ in the hidden layer to permute in sync with.
In short, \(\mathbf{H}_{perm}\mathbf{W}^{(2)}_{perm} = \mathbf{H} \mathbf{W}^{(2)}\) so at this junction there is no need to permute \(\mathbf{b}^{(2)}\).
Backward Propagation with Permuted Weights#
To demonstrate that the backpropagation gradients are invariant under the permutation symmetry, we’ll consider the loss function \(\mathcal{L}(\cdot)\), which is a function of the output layer activations \(\mathbf{O}\) or \(\mathbf{O}_{\text{perm}}\) in the permuted case. For simplicity, let’s assume the loss function is Mean Squared Error (MSE) between \(\mathbf{O}\) and some target \(\mathbf{Y}\), defined as:
We know from the forward propagation that \(\mathbf{O}_{\text{perm}} = \mathbf{O}\), so the loss \(\mathcal{L}\) remains invatiant under the permutation symmetry. This means that the gradients of the loss with respect to the weights and biases are also invariant under the permutation symmetry. Let’s see why this is the case by deriving the gradients of the loss with respect to the weights and biases.
Gradient of Loss with Respect to Output Layer Activations#
The reason we need to find the gradient of the loss with respect to the output layer activations \(\mathbf{O}\) first is due to the chain rule in calculus. The chain rule allows us to break down the gradient of a composite function into simpler parts.
In the context of neural networks, the loss \(\mathcal{L}\) is a function of the output layer activations \(\mathbf{O}\), and the output layer activations \(\mathbf{O}\) are in turn a function of the output layer weights \(\mathbf{W}^{(2)}\). Therefore, the gradient of the loss with respect to \(\mathbf{W}^{(2)}\) can be obtained by chaining these gradients together.
The output \(\mathbf{O}\) is given by Equation \(1.2\):
To find the gradient of the loss with respect to \(\mathbf{W}^{(2)}\), we start by taking the partial derivative of \(\mathcal{L}\) with respect to \(\mathbf{O}\):
Next, we need to find \(\frac{\partial \mathbf{O}}{\partial \mathbf{W}^{(2)}}\). Using Equation \(1.2\), we get:
Finally, we find the gradient of the loss \(\mathcal{L}\) with respect to \(\mathbf{W}^{(2)}\) as:
Here, \(\mathbf{H}^T (\mathbf{O} - \mathbf{Y})\) follows from the chain rule of calculus, and the transposition on \(\mathbf{H}\) ensures the matrix dimensions are compatible for multiplication. This shows explicitly why the gradient of the loss with respect to \(\mathbf{W}^{(2)}\) is as given.
Gradient of Loss with Respect to Output Layer Weights#
The gradient of the loss with respect to the output layer weights \(\mathbf{W}^{(2)}\) is given by:
For the permuted weights, the gradient would be:
since
\(\mathbf{H}_{\text{perm}} = \mathbf{H} \mathbf{P}\) (from Equation \(1.3\))
\(\mathbf{O}_{\text{perm}} = \mathbf{O}\) (from Equation \(1.2\))
This means the gradient with respect to the output layer weights has also been permuted in a consistent manner.
Gradient of Loss with Respect to Output Layer Biases#
The loss function is given as the Mean Squared Error (MSE):
For a single output neuron, the output \(\mathbf{O}\) can be represented as:
Differentiating the loss function \(\mathcal{L}\) with respect to \(\mathbf{b}^{(2)}\) gives:
Here, \(\frac{\partial \mathbf{O}}{\partial \mathbf{b}^{(2)}} = 1\) because the bias term \(\mathbf{b}^{(2)}\) has a coefficient of 1 in the equation for \(\mathbf{O}\).
Also, \(\frac{\partial \mathcal{L}}{\partial \mathbf{O}} = \mathbf{O} - \mathbf{Y}\) because the loss function is \(\frac{1}{2}(\mathbf{O} - \mathbf{Y})^2\).
Putting it all together:
Recall that there’s no need to permute \(\mathbf{b}^{(2)}\) because it’s not directly tied to the hidden units \(h\). So in our case, we do not need to calculate the gradient of the loss with respect to \(\mathbf{b}^{(2)}_{\text{perm}}\) because we will just use \(\mathbf{b}^{(2)}\) as is.
Permuted Gradients for Model Update#
If we perform one update of the weights using the gradients that respect the permutation symmetry, the output of the permuted forward pass will again be the same as the original. This is because we’ve started with permuted weights that give the same output (\(\mathbf{O}_{\text{perm}} = \mathbf{O}\)), and we’ve updated them using permuted gradients that are functionally equivalent to the original gradients.
To be precise, let’s say we perform an update using a learning rate \(\alpha\) and recall the relationship between the original and permuted weights:
Original weights get updated as:
\[\begin{split} \begin{aligned} \mathbf{W}^{(1)} &\leftarrow \mathbf{W}^{(1)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} && \\ \mathbf{b}^{(1)} &\leftarrow \mathbf{b}^{(1)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}} && \\ \mathbf{W}^{(2)} &\leftarrow \mathbf{W}^{(2)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}} && \\ \mathbf{b}^{(2)} &\leftarrow \mathbf{b}^{(2)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(2)}} && \end{aligned} \end{split}\]Permuted weights get updated as:
\[\begin{split} \begin{aligned} \mathbf{W}^{(1)}_{\text{perm}} &\leftarrow \mathbf{W}^{(1)}_{\text{perm}} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}_{\text{perm}}} && \\ \mathbf{b}^{(1)}_{\text{perm}} &\leftarrow \mathbf{b}^{(1)}_{\text{perm}} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}_{\text{perm}}} && \\ \mathbf{W}^{(2)}_{\text{perm}} &\leftarrow \mathbf{W}^{(2)}_{\text{perm}} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}_{\text{perm}}} && \\ \mathbf{b}^{(2)} &\leftarrow \mathbf{b}^{(2)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(2)}} && \end{aligned} \end{split}\]Permuted Model Update:
For the first layer, the permuted weight \(\mathbf{W}^{(1)}_{\text{perm}}\) gets updated as:
\[ \mathbf{W}^{(1)}_{\text{perm}} \leftarrow \mathbf{W}^{(1)}_{\text{perm}} - \alpha \left( \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} \mathbf{P} \right) \]For the first layer, the permuted bias \(\mathbf{b}^{(1)}_{\text{perm}}\) gets updated as:
\[ \mathbf{b}^{(1)}_{\text{perm}} \leftarrow \mathbf{b}^{(1)}_{\text{perm}} - \alpha \left( \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}} \mathbf{P} \right) \]For the second layer, the permuted weight \(\mathbf{W}^{(2)}_{\text{perm}}\) gets updated as:
\[ \mathbf{W}^{(2)}_{\text{perm}} \leftarrow \mathbf{W}^{(2)}_{\text{perm}} - \alpha \left( \mathbf{P}^T \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}} \right) \]
To validate that the permuted weights remain consistent with the original weights after updates, we can rewrite \(\mathbf{W}^{(1)}_{\text{perm}}\) and \(\mathbf{W}^{(2)}_{\text{perm}}\) in terms of \(\mathbf{W}^{(1)}\) and \(\mathbf{W}^{(2)}\) respectively:
For updated \(\mathbf{W}^{(1)}_{\text{perm}}\):
\[\begin{split} \begin{aligned} \mathbf{W}^{(1)}_{\text{perm}} - \alpha \left( \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} \mathbf{P} \right) &= \mathbf{W}^{(1)} \mathbf{P} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} \mathbf{P} \\ &= (\mathbf{W}^{(1)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}}) \mathbf{P} \\ &= \mathbf{W}^{(1)} \mathbf{P} \end{aligned} \end{split}\]For \(\mathbf{b}^{(1)}_{\text{perm}}\):
\[\begin{split} \begin{aligned} \mathbf{b}^{(1)}_{\text{perm}} - \alpha \left( \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}} \mathbf{P} \right) &= \mathbf{b}^{(1)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}} \mathbf{P} \\ &= (\mathbf{b}^{(1)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}}) \mathbf{P} \\ &= \mathbf{b}^{(1)} \mathbf{P} \end{aligned} \end{split}\]For \(\mathbf{W}^{(2)}_{\text{perm}}\):
\[ \mathbf{W}^{(2)}_{\text{perm}} - \alpha \left( \mathbf{P}^T \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}} \right) = \mathbf{P}^T \mathbf{W}^{(2)} - \alpha \mathbf{P}^T \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}} \]\[ = \mathbf{P}^T (\mathbf{W}^{(2)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}}) = \mathbf{P}^T \mathbf{W}^{(2)} \]
Thus, the newly updated weights \(\mathbf{W}^{(1)}_{\text{perm}}\) and \(\mathbf{W}^{(2)}_{\text{perm}}\) remain valid permuted versions of the newly updated \(\mathbf{W}^{(1)}\) and \(\mathbf{W}^{(2)}\), respectively, even after the weight updates. Therefore, the output of the permuted forward pass will again be the same as the original one by definition since we have already shown earlier that outputs remain invariant under the permutation symmetry.
Implementation (Connecting Theory to Practice)#
…
How Permutation Symmetry Links to Over-Parametrization#
The concept of unidentifiability or over-parametrization in a statistical model refers to the situation where multiple sets of parameter values produce the same output. This means that the model parameters are not uniquely determined by the data. Permutation symmetry in neural networks is an illustrative example of this phenomenon: multiple configurations of hidden neurons produce the same output, rendering the model over-parametrized.
Implications of Permutation Symmetry#
Non-Unique Minima: In a neural network with permutation symmetry, multiple sets of weights can yield the same minimum loss. These multiple minima are essentially equivalent in terms of their predictive power but different in their parameterization.
Redundancy: The permutation symmetry makes certain neurons or features in the hidden layers redundant, as swapping them doesn’t change the output. This is a form of over-parametrization, where we have more parameters than we actually need to capture the underlying data distribution.
Impacts on Training: When training an over-parametrized model, we could end up with different but equally valid solutions depending on the initial conditions and optimization trajectory. This non-uniqueness can sometimes make it difficult to interpret the learned features or compare different models.
Regularization: Over-parametrized models often require regularization techniques to constrain the space of possible solutions. However, standard regularization techniques may not directly resolve the non-uniqueness issue caused by permutation symmetry.
Statistical Inference: In classical statistics, over-parametrization can lead to issues like multicollinearity, where it becomes difficult to estimate the individual contribution of each parameter. Similarly, in a neural network with permutation symmetry, any interpretation of the importance of individual neurons in a layer may be misleading.
Computational Challenges: The presence of multiple equivalent minima can make the optimization landscape more complicated, affecting the efficiency and stability of learning algorithms.
Model Interpretability: Understanding the impact of each neuron or feature becomes challenging due to this non-uniqueness.
Optimization: Algorithms that are aware of this symmetry can potentially be developed for more efficient optimization.
In summary, permutation symmetry serves as a specific example of how over-parametrization can manifest in machine learning models, particularly neural networks, and brings with it a host of challenges and considerations in both training and interpretation.
On the Importance of Weight Initialization#
Having established that neural networks exhibit permutation symmetry, where the units in a hidden layer are fundamentally interchangeable, we can now delve into a practical issue that arises from this symmetry: the problem of constant weight initialization. Why is this relevant? Because the interchangeable nature of hidden units under permutation symmetry highlights a critical point—these units start as blank slates with equal potential to learn from the data. However, if we initialize all these ‘indistinguishable’ units with constant weights, they will evolve in lockstep during training.
Their gradients will be identical due to the permutation symmetry, ensuring that they continue to remain identical, thereby severely limiting the expressive power of the network. In this way, the concept of permutation symmetry serves as a foundational understanding that makes the dangers of constant weight initialization more intuitive.
This symmetry is not just a theoretical artifact; it has practical implications for the network’s learning dynamics. Imagine initializing all the elements of \(\mathbf{W}^{(1)}\) to a constant \(c\). During forward propagation, the activations in the hidden layer will then be identical due to the uniform weights, leading to identical contributions to the output. During backpropagation, the gradient with respect to \(\mathbf{W}^{(1)}\) will have identical elements. Formally, let \(\nabla \mathbf{W}^{(1)}\) be the gradient; then \(\nabla \mathbf{W}^{(1)} = c' \mathbf{1}\), where \(c'\) is a constant and \(\mathbf{1}\) is a matrix of ones of appropriate dimensions.
As a result of this symmetry, gradient-based optimization methods like minibatch stochastic gradient descent (SGD) will not be able to “break” this symmetry. Every update leaves \(\mathbf{W}^{(1)}\) in a space of constant matrices, effectively causing the network to act as if it had a single hidden unit, thereby limiting its expressive power.
Constant Weight Initialization#
…
TMP#
Key#
the key is the hidden units of a hidden layer.
each hidden unit is a neuron (picture it) and takes in the same input (picture it) and ideally you want each hidden unit to learn a different feature (picture it) with a different weight (picture it) but if you initialize all the weights to the same value (picture it) then all the hidden units will learn the same feature (picture it) with the same weight (picture it) then the hidden layer will behave as if it had only one hidden unit (picture it) because all the hidden units are the same (picture it). Furthermore, if you do a backpropagation (picture it) then the gradient will be the same for all the hidden units (picture it) and the weight update will be the same for all the hidden units (picture it) and the hidden units will remain the same (picture it) and the hidden layer will behave as if it had only one hidden unit (picture it).
but the next iteration the values of h’s will be different from the previous iteration due to the weight updatE?
Yes, the values of \(h\) will change from one iteration to the next due to the weight update, but the key point is that all \(h\) values will still be the same as each other at any given iteration. In other words, each hidden unit will continue to compute the same function as all the other hidden units, even though that function will evolve as the weights are updated. This is because the gradients for each hidden unit are identical at each step, leading to identical updates for all hidden units. Thus, the symmetry is never broken, and the neural network fails to utilize the full expressiveness of having multiple hidden units.
Notation and Assumptions#
For the sake of simplicity, we will ignore bias and also assume this is a MLP with linear layers stacked.
\(\mathbf{X} \in \mathbb{R}^{n \times d}\): A minibatch of \(n\) examples, each with \(d\) features.
\(\mathbf{W}^{(l)} \in \mathbb{R}^{d^{(l-1)} \times d^{(l)}}\): The weight matrix for layer \(l\), where \(d^{(l-1)}\) and \(d^{(l)}\) are the number of units in layers \(l-1\) and \(l\), respectively.
Each column of \(\mathbf{W}^{(l)}\) represents the weights connecting one unit in layer \(l\) to all units in layer \(l-1\). Specifically, the \(j\)-th column of \(\mathbf{W}^{(l)}\) contains the weights that connect the \(j\)-th unit of layer \(l\) to every unit in layer \(l-1\). This setup allows the linear transformation \(\mathbf{Z}^{(l)} = \mathbf{H}^{(l-1)} \mathbf{W}^{(l)}\) to be efficiently computed.
Each column of \(\mathbf{W}^{(l)}\) essentially constitutes the weight vector for a single neuron (unit) in layer \(l\). These weights are used to connect this neuron to all neurons in the previous layer \(l-1\). So, you can think of the \(j\)-th column of \(\mathbf{W}^{(l)}\) as the set of weights for the \(j\)-th neuron in layer \(l\), where the entry \(\mathbf{W}_{ij}^{(l)}\) is the weight connecting the \(i\)-th neuron in layer \(l-1\) to the \(j\)-th neuron in layer \(l\).
\(\mathbf{H}^{(l-1)} \in \mathbb{R}^{n \times d^{(l-1)}}\): The activations of layer \(l-1\).
\(\mathbf{H}^{(l)} \in \mathbb{R}^{n \times d^{(l)}}\): The activations of layer \(l\).
We denote if \(l\) starts from \(1\), and define the following for when \(l=0\), if \(l=1\),
\(d^{(l-1)} = d^{(0)} = d\).
\(\mathbf{H}^{(l-1)} = \mathbf{H}^{(0)} = \mathbf{X} \in \mathbb{R}^{n \times d}\).
\(\mathbf{W}^{(l-1)} = \mathbf{W}^{(0)} \in \mathbb{R}^{d \times d^{(1)}}\).
\(\mathbf{Z}^{(l)} = \mathbf{H}^{(l-1)} \mathbf{W}^{(l)} \in \mathbb{R}^{n \times d^{(l)}}\): The pre-activation values.
\(\phi(\cdot)\) is the activation function. Assume it is applied element-wise to a vector or matrix and all layers use the same activation function so we do not need to index it by layer.
Note here \(d^{(l)}\) is the number of units (or neurons) in layer \(l\). So if our input is \(d\)-dimensional, then we can transform it to any dimension \(d^{(l)}\) through the use of linear layers.
Assumptions#
All hidden units in a given layer \(l\) have identical weights and biases.
More concretely and formally, it means that for any hidden units \(i\) and \(j\) in layer \(l\), \(\mathbf{W}_{:,i}^{(l)} = \mathbf{W}_{:,j}^{(l)}\), where \(\mathbf{W}_{:,i}^{(l)}\) is the \(i\)-th column of \(\mathbf{W}^{(l)}\).
We use the same activation function \(\phi(\cdot)\) for all hidden units in that layer.
For simplicity, we ignore any regularization term in the loss function.
Notation and Assumptions#
Your notation and assumptions are well-defined, and they serve as a good foundation for the proof. You’ve chosen to focus on multi-layer perceptrons (MLPs) with linear layers and to ignore biases for simplicity, which is a reasonable approach for this proof.
Mathematical Proof for Identical Activations Producing Identical Gradients#
The intuition is simple, if the activations are identical, then the gradients will be identical as well. This is because the gradients are calculated based on the activations.
Consider a simple SGD update rule for a single weight \(W_{ij}^{(l)}\):
where \(\eta\) is the learning rate and \(L\) is the loss function.
Note very carefully that in logistic regression, we often have the weight update to be something like \(\mathbf{W} \leftarrow \mathbf{W} - \eta \nabla \mathcal{L}(\mathbf{W})\) where \(\nabla \mathcal{L}(\mathbf{W})\) is the gradient of the loss function with respect to the weights. This is because the weights are a vector. However, in the case of neural networks, the weights are a matrix, so we need to update each entry of the matrix individually.
The gradient
Let’s prove this formally.
References and Further Readings#
…