1. Network Architecture Overview
2-Layer Feedforward Network for Binary Classification
Input Layer → Hidden Layer (ReLU) → Output Layer (Sigmoid) → Prediction
Example Dimensions
- Input dimension: 4 features
- Hidden dimension: 3 neurons
- Output dimension: 1 (binary classification)
- Batch size: 2 samples
2. Initialization
1Parameter Setup
We need 4 weight/bias matrices to connect all layers:
| Parameter |
Shape |
Description |
| W₁ |
(4, 3) |
Weights connecting input → hidden layer |
| b₁ |
(1, 3) |
Bias for hidden layer |
| W₂ |
(3, 1) |
Weights connecting hidden → output layer |
| b₂ |
(1, 1) |
Bias for output layer |
2Xavier Initialization
W₁ = random_normal(4, 3) × √(2 / input_dim)
Why Xavier Initialization?
Xavier initialization scales weights by √(2/n_in) where n_in = number of input neurons.
Purpose: Prevents vanishing/exploding gradients by keeping variance of activations consistent across layers.
self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(2. / input_dim)
self.b1 = np.zeros((1, hidden_dim))
self.W2 = np.random.randn(hidden_dim, 1) * np.sqrt(2. / hidden_dim)
self.b2 = np.zeros((1, 1))
3Example Initial Values
W₁ = [[ 0.35, -0.21, 0.48],
[-0.15, 0.67, 0.12],
[ 0.44, -0.39, 0.55],
[-0.28, 0.19, -0.41]] # Shape: (4, 3)
b₁ = [[0., 0., 0.]] # Shape: (1, 3)
W₂ = [[ 0.58],
[-0.42],
[ 0.31]] # Shape: (3, 1)
b₂ = [[0.]] # Shape: (1, 1)
3. Forward Pass
The forward pass transforms input data through the network layers to produce a prediction.
1Input Batch
X = [[0.2, 0.5, 0.1, 0.8],
[0.3, 0.7, 0.4, 0.6]]
Shape: (2, 4)
2Hidden Layer Pre-Activation
Formula: Z₁ = X · W₁ + b₁
Calculation for first element (Sample 1, Neuron 1):
Z₁[0,0] = 0.2×0.35 + 0.5×(-0.15) + 0.1×0.44 + 0.8×(-0.28) + 0
= 0.07 - 0.075 + 0.044 - 0.224
= -0.185
Result:
Z₁ = [[-0.185, 0.321, 0.156],
[-0.095, 0.487, 0.203]] # Shape: (2, 3)
self.Z1 = np.dot(X, self.W1) + self.b1
3Hidden Layer Activation (ReLU)
Formula: A₁ = ReLU(Z₁) = max(0, Z₁)
ReLU Function:
⎧ x if x > 0
ReLU(x) = ⎨
⎩ 0 if x ≤ 0
Result:
A₁ = [[0.000, 0.321, 0.156],
[0.000, 0.487, 0.203]] # Shape: (2, 3)
self.A1 = relu(self.Z1)
4Output Layer Pre-Activation
Formula: Z₂ = A₁ · W₂ + b₂
Z₂[0] = 0.000×0.58 + 0.321×(-0.42) + 0.156×0.31 + 0
= 0 - 0.135 + 0.048
= -0.087
Result:
Z₂ = [[-0.087],
[-0.142]] # Shape: (2, 1)
self.Z2 = np.dot(self.A1, self.W2) + self.b2
5Output Activation (Sigmoid)
Formula: A₂ = σ(Z₂) = 1 / (1 + e^(-Z₂))
Computation:
A₂[0] = 1 / (1 + e^0.087) = 1 / (1 + 1.091) = 0.478
A₂[1] = 1 / (1 + e^0.142) = 1 / (1 + 1.153) = 0.465
Result:
A₂ = [[0.478],
[0.465]] # Shape: (2, 1)
Interpretation: Sample 1 has 47.8% probability of ASD, Sample 2 has 46.5%
self.A2 = sigmoid(self.Z2)
return self.A2
4. Loss Calculation
Binary Cross-Entropy Loss
Formula:
L = -1/m × Σ[y_i × log(ŷ_i) + (1 - y_i) × log(1 - ŷ_i)]
Where:
m = number of samples (batch size)
y_i = true label (0 or 1)
ŷ_i = predicted probability
Example Calculation
True labels:
y = [[1],
[0]]
Predicted probabilities:
y_pred = [[0.478],
[0.465]]
For Sample 1 (y=1):
Loss₁ = -(1 × log(0.478) + 0 × log(1-0.478))
= -log(0.478)
= -(-0.738)
= 0.738
For Sample 2 (y=0):
Loss₂ = -(0 × log(0.465) + 1 × log(1-0.465))
= -log(0.535)
= -(-0.625)
= 0.625
Total Loss:
L = (0.738 + 0.625) / 2 = 0.682
def binary_cross_entropy(y_true, y_pred):
eps = 1e-8
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
5. Backward Pass (Backpropagation)
Goal: Chain Rule
We need to find how much each weight contributed to the error, then adjust them accordingly.
Compute gradients: ∂L/∂W₁, ∂L/∂b₁, ∂L/∂W₂, ∂L/∂b₂
1Output Layer Gradient
Derivative of Loss w.r.t. Z₂:
For binary cross-entropy + sigmoid, the derivative simplifies to:
∂L/∂Z₂ = ŷ - y (predicted - actual)
Mathematical Proof:
∂L/∂Z₂ = (∂L/∂A₂) × (∂A₂/∂Z₂)
= [(ŷ - y)/(ŷ(1-ŷ))] × [ŷ(1-ŷ)]
= ŷ - y
This beautiful simplification is why sigmoid + cross-entropy work so well together!
Computation:
dZ₂ = [[0.478]] - [[1]] = [[-0.522]]
[[0.465]] [[0]] [[ 0.465]]
Shape: (2, 1)
dZ2 = y_pred - y.reshape(-1, 1)
2Output Layer Weight Gradient
Formula: ∂L/∂W₂ = (1/m) × A₁ᵀ · dZ₂
Chain rule: ∂L/∂W₂ = (∂L/∂Z₂) × (∂Z₂/∂W₂)
Since Z₂ = A₁·W₂ + b₂, we have ∂Z₂/∂W₂ = A₁
Computation:
dW₂ = (1/2) × [[0.000, 0.000]]ᵀ · [[-0.522]]
[[0.321, 0.487]] [[ 0.465]]
[[0.156, 0.203]]
dW₂[0] = 0.5 × (0.000×(-0.522) + 0.000×0.465) = 0.000
dW₂[1] = 0.5 × (0.321×(-0.522) + 0.487×0.465) = 0.032
dW₂[2] = 0.5 × (0.156×(-0.522) + 0.203×0.465) = 0.007
Result:
dW₂ = [[ 0.000],
[ 0.032],
[ 0.007]] # Shape: (3, 1)
m = y.shape[0]
dW2 = (1 / m) * np.dot(self.A1.T, dZ2)
3Output Layer Bias Gradient
Formula: ∂L/∂b₂ = mean(dZ₂)
db₂ = mean([[-0.522], [0.465]]) = [[-0.029]] # Shape: (1, 1)
db2 = np.mean(dZ2, axis=0, keepdims=True)
4Hidden Layer Gradient (Backpropagate)
Formula: ∂L/∂A₁ = dZ₂ · W₂ᵀ
Computation:
dA₁ = [[-0.522]] · [[0.58, -0.42, 0.31]]
[[ 0.465]]
dA₁ = [[-0.303, 0.219, -0.162],
[ 0.270, -0.195, 0.144]] # Shape: (2, 3)
dA1 = np.dot(dZ2, self.W2.T)
5Apply ReLU Derivative
Formula: ∂L/∂Z₁ = dA₁ × ReLU'(Z₁)
ReLU Derivative:
⎧ 1 if x > 0
ReLU'(x) = ⎨
⎩ 0 if x ≤ 0
Recall Z₁:
Z₁ = [[-0.185, 0.321, 0.156],
[-0.095, 0.487, 0.203]]
ReLU Derivative Mask:
ReLU'(Z₁) = [[0, 1, 1],
[0, 1, 1]]
Apply Element-wise:
dA₁ = dA₁ × ReLU'(Z₁)
= [[-0.303, 0.219, -0.162]] × [[0, 1, 1]]
[[ 0.270, -0.195, 0.144]] [[0, 1, 1]]
Result:
= [[0.000, 0.219, -0.162],
[0.000, -0.195, 0.144]]
dA1 *= relu_derivative(self.Z1)
6Hidden Layer Weight Gradient
Formula: ∂L/∂W₁ = (1/m) × Xᵀ · dA₁
Computation:
dW₁ = (1/2) × [[0.2, 0.3]]ᵀ · [[0.000, 0.219, -0.162]]
[[0.5, 0.7]] [[0.000, -0.195, 0.144]]
[[0.1, 0.4]]
[[0.8, 0.6]]
Result:
dW₁ = [[ 0.000, 0.007, -0.011],
[ 0.000, -0.014, 0.020],
[ 0.000, -0.017, 0.042],
[ 0.000, 0.070, -0.043]] # Shape: (4, 3)
dW1 = (1 / m) * np.dot(X.T, dA1)
7Hidden Layer Bias Gradient
Formula: ∂L/∂b₁ = mean(dA₁, axis=0)
db₁ = mean([[0.000, 0.219, -0.162],
[0.000, -0.195, 0.144]], axis=0)
= [[0.000, 0.012, -0.009]] # Shape: (1, 3)
db1 = np.mean(dA1, axis=0, keepdims=True)
↓
All Gradients Computed!
Now we can update the weights using gradient descent.
6. Optimization (Gradient Descent)
Update Rule
Formula: θ_new = θ_old - learning_rate × gradient
Where θ represents any parameter (W₁, b₁, W₂, b₂)
Intuition
Weights move in the opposite direction of the gradient to reduce loss.
- If gradient is positive → weight decreases
- If gradient is negative → weight increases
- Larger gradient → bigger update
Example Weight Updates
Learning rate: lr = 0.005
Update W₂:
W₂_new = W₂_old - lr × dW₂
= [[ 0.58]] - 0.005 × [[ 0.000]]
[[-0.42]] [[ 0.032]]
[[ 0.31]] [[ 0.007]]
= [[ 0.58000]]
[[-0.42016]]
[[ 0.30996]]
Update b₂:
b₂_new = b₂_old - lr × db₂
= [[0.]] - 0.005 × [[-0.029]]
= [[0.000145]]
Update W₁:
W₁_new = W₁_old - lr × dW₁
= [[ 0.35, -0.21, 0.48]] - 0.005 × [[ 0.000, 0.007, -0.011]]
[[-0.15, 0.67, 0.12]] [[ 0.000, -0.014, 0.020]]
[[ 0.44, -0.39, 0.55]] [[ 0.000, -0.017, 0.042]]
[[-0.28, 0.19, -0.41]] [[ 0.000, 0.070, -0.043]]
= [[ 0.35000, -0.21004, 0.48006]]
[[-0.15000, 0.67007, 0.11990]]
[[ 0.44000, -0.38991, 0.54979]]
[[-0.28000, 0.18965, -0.40978]]
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
Learning Rate Selection
| Learning Rate |
Effect |
Result |
| Too High (0.1) |
Weights oscillate wildly |
Training diverges, loss increases |
| Too Low (0.00001) |
Tiny weight updates |
Training extremely slow |
| Just Right (0.001-0.01) |
Smooth, steady progress |
Optimal convergence |
Advanced: L2 Regularization
Modified Update Rule:
dW = gradient + (λ/m) × W
W_new = W_old - lr × dW
Purpose of L2 Regularization
Prevents overfitting by penalizing large weights.
λ (lambda) controls regularization strength (e.g., 0.01)
dW2 = (1 / m) * np.dot(self.A1.T, dZ2) + (self.l2_lambda / m) * self.W2
dW1 = (1 / m) * np.dot(X.T, dA1) + (self.l2_lambda / m) * self.W1
7. Training Loop
One Training Epoch
Training Cycle:
⬇
1. Shuffle Data (randomize order)
⬇
2. For Each Mini-Batch:
- → Forward pass → get predictions
- → Calculate loss
- → Backward pass → compute gradients
- → Update weights
⬇
3. Validate on validation set
⬇
4. Repeat for multiple epochs
Why Mini-Batch Training?
| Method |
Batch Size |
Pros |
Cons |
| Batch Gradient Descent |
All samples (e.g., 1000) |
Smooth convergence, stable gradients |
Slow, high memory usage |
| Stochastic GD |
1 sample |
Fast updates, low memory |
Very noisy, unstable |
| Mini-Batch GD |
Small groups (e.g., 16) |
Best balance: fast + stable |
Requires tuning batch size |
Complete Training Code
def fit(self, X, y, epochs=100, batch_size=16, val_data=None):
n = X.shape[0]
history = {'loss': [], 'val_loss': [],
'accuracy': [], 'val_accuracy': []}
for epoch in range(epochs):
idx = np.random.permutation(n)
X_shuffled, y_shuffled = X[idx], y[idx]
batch_losses, batch_acc = [], []
for i in range(0, n, batch_size):
X_batch = X_shuffled[i:i + batch_size]
y_batch = y_shuffled[i:i + batch_size]
y_pred = self.forward(X_batch, training=True)
loss = binary_cross_entropy(y_batch, y_pred)
self.backward(X_batch, y_batch, y_pred)
acc = np.mean((y_pred > 0.5).astype(int)
.flatten() == y_batch)
batch_losses.append(loss)
batch_acc.append(acc)
val_loss, val_acc = 0, 0
if val_data:
X_val, y_val = val_data
y_val_pred = self.forward(X_val, training=False)
val_loss = binary_cross_entropy(y_val, y_val_pred)
val_acc = np.mean((y_val_pred > 0.5).astype(int)
.flatten() == y_val)
history['loss'].append(np.mean(batch_losses))
history['accuracy'].append(np.mean(batch_acc))
history['val_loss'].append(val_loss)
history['val_accuracy'].append(val_acc)
if epoch % 10 == 0:
print(f"Epoch {epoch+1}/{epochs} - "
f"loss: {np.mean(batch_losses):.4f}, "
f"val_acc: {val_acc:.2f}")
return history
Early Stopping
Early Stopping Mechanism
Stop training when validation loss stops improving to prevent overfitting.
best_val_loss = float('inf')
patience_counter = 0
patience = 20
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
best_weights = (self.W1.copy(),
self.b1.copy(),
self.W2.copy(),
self.b2.copy())
else:
patience_counter += 1
if patience_counter >= patience:
print(f"Early stopping at epoch {epoch+1}")
self.W1, self.b1, self.W2, self.b2 = best_weights
break
Training Progress Indicators
What to Watch During Training:
- Training Loss: Should steadily decrease
- Validation Loss: Should decrease, then plateau
- Training Accuracy: Should increase
- Validation Accuracy: Should increase, then stabilize
Warning Signs
- Training loss increasing: Learning rate too high
- Val loss >> Train loss: Overfitting (add regularization)
- Both losses high: Underfitting (increase model capacity)
- Val accuracy fluctuating wildly: Dataset too small
8. Summary: Complete Information Flow
Forward Pass (Making Predictions)
X → [×W₁ + b₁] → Z₁ → [ReLU] → A₁ → [×W₂ + b₂] → Z₂ → [Sigmoid] → ŷ
Step-by-Step:
- Linear transformation: Z₁ = X·W₁ + b₁
- Activation: A₁ = ReLU(Z₁) = max(0, Z₁)
- Linear transformation: Z₂ = A₁·W₂ + b₂
- Activation: ŷ = Sigmoid(Z₂) = 1/(1 + e^(-Z₂))
Backward Pass (Learning from Mistakes)
Loss ← [∂L/∂Z₂=ŷ-y] ← dZ₂ ← [∂L/∂W₂] ← dW₂
↓
[×W₂ᵀ] → dA₁ → [×ReLU'] → dZ₁ → [∂L/∂W₁] → dW₁
Step-by-Step:
- Output gradient: dZ₂ = ŷ - y
- Weight gradient: dW₂ = (1/m) × A₁ᵀ·dZ₂
- Bias gradient: db₂ = mean(dZ₂)
- Backpropagate: dA₁ = dZ₂·W₂ᵀ
- Apply derivative: dA₁ = dA₁ × ReLU'(Z₁)
- Weight gradient: dW₁ = (1/m) × Xᵀ·dA₁
- Bias gradient: db₁ = mean(dA₁)
Weight Update
W_new = W_old - learning_rate × gradient