Neural Network Architecture: Complete Mathematical Walkthrough

1. Network Architecture Overview
2. Initialization
3. Forward Pass
4. Loss Calculation
5. Backward Pass (Backpropagation)
6. Optimization
7. Training Loop
8. Summary

1. Network Architecture Overview

2-Layer Feedforward Network for Binary Classification

Input Layer → Hidden Layer (ReLU) → Output Layer (Sigmoid) → Prediction

Example Dimensions

Input dimension: 4 features
Hidden dimension: 3 neurons
Output dimension: 1 (binary classification)
Batch size: 2 samples

2. Initialization

1Parameter Setup

We need 4 weight/bias matrices to connect all layers:

Parameter	Shape	Description
W₁	(4, 3)	Weights connecting input → hidden layer
b₁	(1, 3)	Bias for hidden layer
W₂	(3, 1)	Weights connecting hidden → output layer
b₂	(1, 1)	Bias for output layer

2Xavier Initialization

W₁ = random_normal(4, 3) × √(2 / input_dim)

Why Xavier Initialization?

Xavier initialization scales weights by √(2/n_in) where n_in = number of input neurons.

Purpose: Prevents vanishing/exploding gradients by keeping variance of activations consistent across layers.

# Xavier initialization
self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(2. / input_dim)
self.b1 = np.zeros((1, hidden_dim))
self.W2 = np.random.randn(hidden_dim, 1) * np.sqrt(2. / hidden_dim)
self.b2 = np.zeros((1, 1))
                

3Example Initial Values

W₁ = [[ 0.35, -0.21, 0.48], [-0.15, 0.67, 0.12], [ 0.44, -0.39, 0.55], [-0.28, 0.19, -0.41]] # Shape: (4, 3) b₁ = [[0., 0., 0.]] # Shape: (1, 3) W₂ = [[ 0.58], [-0.42], [ 0.31]] # Shape: (3, 1) b₂ = [[0.]] # Shape: (1, 1)

3. Forward Pass

The forward pass transforms input data through the network layers to produce a prediction.

1Input Batch

X = [[0.2, 0.5, 0.1, 0.8], # Sample 1 [0.3, 0.7, 0.4, 0.6]] # Sample 2 Shape: (2, 4)

2Hidden Layer Pre-Activation

Formula: Z₁ = X · W₁ + b₁

Calculation for first element (Sample 1, Neuron 1):

Z₁[0,0] = 0.2×0.35 + 0.5×(-0.15) + 0.1×0.44 + 0.8×(-0.28) + 0 = 0.07 - 0.075 + 0.044 - 0.224 = -0.185

Result: Z₁ = [[-0.185, 0.321, 0.156], [-0.095, 0.487, 0.203]] # Shape: (2, 3)

self.Z1 = np.dot(X, self.W1) + self.b1

3Hidden Layer Activation (ReLU)

Formula: A₁ = ReLU(Z₁) = max(0, Z₁) ReLU Function: ⎧ x if x > 0 ReLU(x) = ⎨ ⎩ 0 if x ≤ 0

Result: A₁ = [[0.000, 0.321, 0.156], [0.000, 0.487, 0.203]] # Shape: (2, 3)

self.A1 = relu(self.Z1) # relu(x) = np.maximum(0, x)

4Output Layer Pre-Activation

Formula: Z₂ = A₁ · W₂ + b₂ Z₂[0] = 0.000×0.58 + 0.321×(-0.42) + 0.156×0.31 + 0 = 0 - 0.135 + 0.048 = -0.087 Result: Z₂ = [[-0.087], [-0.142]] # Shape: (2, 1)

self.Z2 = np.dot(self.A1, self.W2) + self.b2

5Output Activation (Sigmoid)

Formula: A₂ = σ(Z₂) = 1 / (1 + e^(-Z₂))

Computation:

A₂[0] = 1 / (1 + e^0.087) = 1 / (1 + 1.091) = 0.478 A₂[1] = 1 / (1 + e^0.142) = 1 / (1 + 1.153) = 0.465 Result: A₂ = [[0.478], [0.465]] # Shape: (2, 1)

Interpretation: Sample 1 has 47.8% probability of ASD, Sample 2 has 46.5%

self.A2 = sigmoid(self.Z2)
return self.A2
                

4. Loss Calculation

Binary Cross-Entropy Loss

Formula: L = -1/m × Σ[y_i × log(ŷ_i) + (1 - y_i) × log(1 - ŷ_i)] Where: m = number of samples (batch size) y_i = true label (0 or 1) ŷ_i = predicted probability

Example Calculation

True labels: y = [[1], # Sample 1 has ASD [0]] # Sample 2 does not have ASD Predicted probabilities: y_pred = [[0.478], [0.465]]

For Sample 1 (y=1):

Loss₁ = -(1 × log(0.478) + 0 × log(1-0.478)) = -log(0.478) = -(-0.738) = 0.738

For Sample 2 (y=0):

Loss₂ = -(0 × log(0.465) + 1 × log(1-0.465)) = -log(0.535) = -(-0.625) = 0.625

Total Loss: L = (0.738 + 0.625) / 2 = 0.682

def binary_cross_entropy(y_true, y_pred):
    eps = 1e-8  # Small value to prevent log(0)
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
                

5. Backward Pass (Backpropagation)

Goal: Chain Rule

We need to find how much each weight contributed to the error, then adjust them accordingly.

Compute gradients: ∂L/∂W₁, ∂L/∂b₁, ∂L/∂W₂, ∂L/∂b₂

1Output Layer Gradient

Derivative of Loss w.r.t. Z₂: For binary cross-entropy + sigmoid, the derivative simplifies to: ∂L/∂Z₂ = ŷ - y (predicted - actual)

Mathematical Proof:

∂L/∂Z₂ = (∂L/∂A₂) × (∂A₂/∂Z₂) = [(ŷ - y)/(ŷ(1-ŷ))] × [ŷ(1-ŷ)] = ŷ - y

This beautiful simplification is why sigmoid + cross-entropy work so well together!

Computation: dZ₂ = [[0.478]] - [[1]] = [[-0.522]] [[0.465]] [[0]] [[ 0.465]] Shape: (2, 1)

dZ2 = y_pred - y.reshape(-1, 1)

2Output Layer Weight Gradient

Formula: ∂L/∂W₂ = (1/m) × A₁ᵀ · dZ₂

Chain rule: ∂L/∂W₂ = (∂L/∂Z₂) × (∂Z₂/∂W₂)

Since Z₂ = A₁·W₂ + b₂, we have ∂Z₂/∂W₂ = A₁

Computation: dW₂ = (1/2) × [[0.000, 0.000]]ᵀ · [[-0.522]] [[0.321, 0.487]] [[ 0.465]] [[0.156, 0.203]] dW₂[0] = 0.5 × (0.000×(-0.522) + 0.000×0.465) = 0.000 dW₂[1] = 0.5 × (0.321×(-0.522) + 0.487×0.465) = 0.032 dW₂[2] = 0.5 × (0.156×(-0.522) + 0.203×0.465) = 0.007 Result: dW₂ = [[ 0.000], [ 0.032], [ 0.007]] # Shape: (3, 1)

m = y.shape[0]
dW2 = (1 / m) * np.dot(self.A1.T, dZ2)
                

3Output Layer Bias Gradient

Formula: ∂L/∂b₂ = mean(dZ₂) db₂ = mean([[-0.522], [0.465]]) = [[-0.029]] # Shape: (1, 1)

db2 = np.mean(dZ2, axis=0, keepdims=True)

4Hidden Layer Gradient (Backpropagate)

Formula: ∂L/∂A₁ = dZ₂ · W₂ᵀ Computation: dA₁ = [[-0.522]] · [[0.58, -0.42, 0.31]] [[ 0.465]] dA₁ = [[-0.303, 0.219, -0.162], [ 0.270, -0.195, 0.144]] # Shape: (2, 3)

dA1 = np.dot(dZ2, self.W2.T)

5Apply ReLU Derivative

Formula: ∂L/∂Z₁ = dA₁ × ReLU'(Z₁) ReLU Derivative: ⎧ 1 if x > 0 ReLU'(x) = ⎨ ⎩ 0 if x ≤ 0

Recall Z₁: Z₁ = [[-0.185, 0.321, 0.156], [-0.095, 0.487, 0.203]] ReLU Derivative Mask: ReLU'(Z₁) = [[0, 1, 1], # First neuron was negative [0, 1, 1]]

Apply Element-wise:

dA₁ = dA₁ × ReLU'(Z₁) = [[-0.303, 0.219, -0.162]] × [[0, 1, 1]] [[ 0.270, -0.195, 0.144]] [[0, 1, 1]] Result: = [[0.000, 0.219, -0.162], [0.000, -0.195, 0.144]]

dA1 *= relu_derivative(self.Z1)

6Hidden Layer Weight Gradient

Formula: ∂L/∂W₁ = (1/m) × Xᵀ · dA₁ Computation: dW₁ = (1/2) × [[0.2, 0.3]]ᵀ · [[0.000, 0.219, -0.162]] [[0.5, 0.7]] [[0.000, -0.195, 0.144]] [[0.1, 0.4]] [[0.8, 0.6]] Result: dW₁ = [[ 0.000, 0.007, -0.011], [ 0.000, -0.014, 0.020], [ 0.000, -0.017, 0.042], [ 0.000, 0.070, -0.043]] # Shape: (4, 3)

dW1 = (1 / m) * np.dot(X.T, dA1)

7Hidden Layer Bias Gradient

Formula: ∂L/∂b₁ = mean(dA₁, axis=0) db₁ = mean([[0.000, 0.219, -0.162], [0.000, -0.195, 0.144]], axis=0) = [[0.000, 0.012, -0.009]] # Shape: (1, 3)

db1 = np.mean(dA1, axis=0, keepdims=True)

↓

All Gradients Computed!

Now we can update the weights using gradient descent.

6. Optimization (Gradient Descent)

Update Rule

Formula: θ_new = θ_old - learning_rate × gradient Where θ represents any parameter (W₁, b₁, W₂, b₂)

Intuition

Weights move in the opposite direction of the gradient to reduce loss.

If gradient is positive → weight decreases
If gradient is negative → weight increases
Larger gradient → bigger update

Example Weight Updates

Learning rate: lr = 0.005

Update W₂:

W₂_new = W₂_old - lr × dW₂ = [[ 0.58]] - 0.005 × [[ 0.000]] [[-0.42]] [[ 0.032]] [[ 0.31]] [[ 0.007]] = [[ 0.58000]] # Almost no change [[-0.42016]] # Decreased by 0.005×0.032 [[ 0.30996]] # Decreased by 0.005×0.007

Update b₂:

b₂_new = b₂_old - lr × db₂ = [[0.]] - 0.005 × [[-0.029]] = [[0.000145]]

Update W₁:

W₁_new = W₁_old - lr × dW₁ = [[ 0.35, -0.21, 0.48]] - 0.005 × [[ 0.000, 0.007, -0.011]] [[-0.15, 0.67, 0.12]] [[ 0.000, -0.014, 0.020]] [[ 0.44, -0.39, 0.55]] [[ 0.000, -0.017, 0.042]] [[-0.28, 0.19, -0.41]] [[ 0.000, 0.070, -0.043]] = [[ 0.35000, -0.21004, 0.48006]] [[-0.15000, 0.67007, 0.11990]] [[ 0.44000, -0.38991, 0.54979]] [[-0.28000, 0.18965, -0.40978]]

# Gradient descent updates
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
                

Learning Rate Selection

Learning Rate	Effect	Result
Too High (0.1)	Weights oscillate wildly	Training diverges, loss increases
Too Low (0.00001)	Tiny weight updates	Training extremely slow
Just Right (0.001-0.01)	Smooth, steady progress	Optimal convergence

Advanced: L2 Regularization

Modified Update Rule: dW = gradient + (λ/m) × W # Add regularization term W_new = W_old - lr × dW

Purpose of L2 Regularization

Prevents overfitting by penalizing large weights.

λ (lambda) controls regularization strength (e.g., 0.01)

# With L2 regularization
dW2 = (1 / m) * np.dot(self.A1.T, dZ2) + (self.l2_lambda / m) * self.W2
dW1 = (1 / m) * np.dot(X.T, dA1) + (self.l2_lambda / m) * self.W1
                

7. Training Loop

One Training Epoch

Training Cycle:

⬇

1. Shuffle Data (randomize order)

⬇

2. For Each Mini-Batch:

→ Forward pass → get predictions
→ Calculate loss
→ Backward pass → compute gradients
→ Update weights

⬇

3. Validate on validation set

⬇

4. Repeat for multiple epochs

Why Mini-Batch Training?

Method	Batch Size	Pros	Cons
Batch Gradient Descent	All samples (e.g., 1000)	Smooth convergence, stable gradients	Slow, high memory usage
Stochastic GD	1 sample	Fast updates, low memory	Very noisy, unstable
Mini-Batch GD	Small groups (e.g., 16)	Best balance: fast + stable	Requires tuning batch size

Complete Training Code

def fit(self, X, y, epochs=100, batch_size=16, val_data=None):
    n = X.shape[0]
    history = {'loss': [], 'val_loss': [],
               'accuracy': [], 'val_accuracy': []}

    for epoch in range(epochs):
        # 1. Shuffle data
        idx = np.random.permutation(n)
        X_shuffled, y_shuffled = X[idx], y[idx]
        batch_losses, batch_acc = [], []

        # 2. Process each mini-batch
        for i in range(0, n, batch_size):
            X_batch = X_shuffled[i:i + batch_size]
            y_batch = y_shuffled[i:i + batch_size]

            # Forward pass
            y_pred = self.forward(X_batch, training=True)

            # Compute loss
            loss = binary_cross_entropy(y_batch, y_pred)

            # Backward pass
            self.backward(X_batch, y_batch, y_pred)

            # Calculate accuracy
            acc = np.mean((y_pred > 0.5).astype(int)
                              .flatten() == y_batch)
            batch_losses.append(loss)
            batch_acc.append(acc)

        # 3. Validation
        val_loss, val_acc = 0, 0
        if val_data:
            X_val, y_val = val_data
            y_val_pred = self.forward(X_val, training=False)
            val_loss = binary_cross_entropy(y_val, y_val_pred)
            val_acc = np.mean((y_val_pred > 0.5).astype(int)
                                .flatten() == y_val)

        # 4. Record history
        history['loss'].append(np.mean(batch_losses))
        history['accuracy'].append(np.mean(batch_acc))
        history['val_loss'].append(val_loss)
        history['val_accuracy'].append(val_acc)

        if epoch % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs} - "
                  f"loss: {np.mean(batch_losses):.4f}, "
                  f"val_acc: {val_acc:.2f}")

    return history
                

Early Stopping

Early Stopping Mechanism

Stop training when validation loss stops improving to prevent overfitting.

# Early stopping logic
best_val_loss = float('inf')
patience_counter = 0
patience = 20  # Stop after 20 epochs without improvement

if val_loss < best_val_loss:
    best_val_loss = val_loss
    patience_counter = 0
    # Save best weights
    best_weights = (self.W1.copy(),
                    self.b1.copy(),
                    self.W2.copy(),
                    self.b2.copy())
else:
    patience_counter += 1

if patience_counter >= patience:
    print(f"Early stopping at epoch {epoch+1}")
    # Restore best weights
    self.W1, self.b1, self.W2, self.b2 = best_weights
    break
                

Training Progress Indicators

What to Watch During Training:

Training Loss: Should steadily decrease
Validation Loss: Should decrease, then plateau
Training Accuracy: Should increase
Validation Accuracy: Should increase, then stabilize

Warning Signs

Training loss increasing: Learning rate too high
Val loss >> Train loss: Overfitting (add regularization)
Both losses high: Underfitting (increase model capacity)
Val accuracy fluctuating wildly: Dataset too small

8. Summary: Complete Information Flow

Forward Pass (Making Predictions)

X → [×W₁ + b₁] → Z₁ → [ReLU] → A₁ → [×W₂ + b₂] → Z₂ → [Sigmoid] → ŷ

Step-by-Step:

Linear transformation: Z₁ = X·W₁ + b₁
Activation: A₁ = ReLU(Z₁) = max(0, Z₁)
Linear transformation: Z₂ = A₁·W₂ + b₂
Activation: ŷ = Sigmoid(Z₂) = 1/(1 + e^(-Z₂))

Backward Pass (Learning from Mistakes)

Loss ← [∂L/∂Z₂=ŷ-y] ← dZ₂ ← [∂L/∂W₂] ← dW₂ ↓ [×W₂ᵀ] → dA₁ → [×ReLU'] → dZ₁ → [∂L/∂W₁] → dW₁

Step-by-Step:

Output gradient: dZ₂ = ŷ - y
Weight gradient: dW₂ = (1/m) × A₁ᵀ·dZ₂
Bias gradient: db₂ = mean(dZ₂)
Backpropagate: dA₁ = dZ₂·W₂ᵀ
Apply derivative: dA₁ = dA₁ × ReLU'(Z₁)
Weight gradient: dW₁ = (1/m) × Xᵀ·dA₁
Bias gradient: db₁ = mean(dA₁)

Weight Update

W_new = W_old - learning_rate × gradient

This cycle

Table of Contents

1. Network Architecture Overview

2-Layer Feedforward Network for Binary Classification

Example Dimensions

2. Initialization

1Parameter Setup

2Xavier Initialization

Why Xavier Initialization?

3Example Initial Values

3. Forward Pass

1Input Batch

2Hidden Layer Pre-Activation

Calculation for first element (Sample 1, Neuron 1):

3Hidden Layer Activation (ReLU)

4Output Layer Pre-Activation

5Output Activation (Sigmoid)

Computation:

4. Loss Calculation

Binary Cross-Entropy Loss

Example Calculation

For Sample 1 (y=1):

For Sample 2 (y=0):

5. Backward Pass (Backpropagation)

Goal: Chain Rule

1Output Layer Gradient

Mathematical Proof:

2Output Layer Weight Gradient

3Output Layer Bias Gradient

4Hidden Layer Gradient (Backpropagate)

5Apply ReLU Derivative

Apply Element-wise:

6Hidden Layer Weight Gradient

7Hidden Layer Bias Gradient

All Gradients Computed!

6. Optimization (Gradient Descent)

Update Rule

Intuition

Example Weight Updates

Update W₂:

Update b₂:

Update W₁:

Learning Rate Selection

Advanced: L2 Regularization

Purpose of L2 Regularization

7. Training Loop

One Training Epoch

Training Cycle:

Why Mini-Batch Training?

Complete Training Code

Early Stopping

Early Stopping Mechanism

Training Progress Indicators

What to Watch During Training:

Warning Signs

8. Summary: Complete Information Flow

Forward Pass (Making Predictions)

Step-by-Step:

Backward Pass (Learning from Mistakes)

Step-by-Step:

Weight Update