🧠 Neural Network Architecture

Complete Mathematical & Code Walkthrough for ASD Classification

1. Network Architecture Overview

2-Layer Feedforward Network for Binary Classification

Input Layer → Hidden Layer (ReLU) → Output Layer (Sigmoid) → Prediction

Input Layer (4 features) x₁ x₂ x₃ x₄ Hidden Layer (3 neurons) h₁ h₂ h₃ Output Layer (1 neuron) ŷ W₁ (4 × 3) b₁ (1 × 3) W₂ (3 × 1) b₂ (1 × 1) ReLU Sigmoid Prediction (0 or 1)

Example Dimensions

  • Input dimension: 4 features
  • Hidden dimension: 3 neurons
  • Output dimension: 1 (binary classification)
  • Batch size: 2 samples

2. Initialization

1Parameter Setup

We need 4 weight/bias matrices to connect all layers:

Parameter Shape Description
W₁ (4, 3) Weights connecting input → hidden layer
b₁ (1, 3) Bias for hidden layer
W₂ (3, 1) Weights connecting hidden → output layer
b₂ (1, 1) Bias for output layer

2Xavier Initialization

W₁ = random_normal(4, 3) × √(2 / input_dim)

Why Xavier Initialization?

Xavier initialization scales weights by √(2/n_in) where n_in = number of input neurons.

Purpose: Prevents vanishing/exploding gradients by keeping variance of activations consistent across layers.

# Xavier initialization self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(2. / input_dim) self.b1 = np.zeros((1, hidden_dim)) self.W2 = np.random.randn(hidden_dim, 1) * np.sqrt(2. / hidden_dim) self.b2 = np.zeros((1, 1))

3Example Initial Values

W₁ = [[ 0.35, -0.21, 0.48], [-0.15, 0.67, 0.12], [ 0.44, -0.39, 0.55], [-0.28, 0.19, -0.41]] # Shape: (4, 3) b₁ = [[0., 0., 0.]] # Shape: (1, 3) W₂ = [[ 0.58], [-0.42], [ 0.31]] # Shape: (3, 1) b₂ = [[0.]] # Shape: (1, 1)

3. Forward Pass

The forward pass transforms input data through the network layers to produce a prediction.

1Input Batch

X = [[0.2, 0.5, 0.1, 0.8], # Sample 1 [0.3, 0.7, 0.4, 0.6]] # Sample 2 Shape: (2, 4)

2Hidden Layer Pre-Activation

Formula: Z₁ = X · W₁ + b₁

Calculation for first element (Sample 1, Neuron 1):

Z₁[0,0] = 0.2×0.35 + 0.5×(-0.15) + 0.1×0.44 + 0.8×(-0.28) + 0 = 0.07 - 0.075 + 0.044 - 0.224 = -0.185
Result: Z₁ = [[-0.185, 0.321, 0.156], [-0.095, 0.487, 0.203]] # Shape: (2, 3)
self.Z1 = np.dot(X, self.W1) + self.b1

3Hidden Layer Activation (ReLU)

Formula: A₁ = ReLU(Z₁) = max(0, Z₁) ReLU Function: ⎧ x if x > 0 ReLU(x) = ⎨ ⎩ 0 if x ≤ 0
Result: A₁ = [[0.000, 0.321, 0.156], [0.000, 0.487, 0.203]] # Shape: (2, 3)
self.A1 = relu(self.Z1) # relu(x) = np.maximum(0, x)

4Output Layer Pre-Activation

Formula: Z₂ = A₁ · W₂ + b₂ Z₂[0] = 0.000×0.58 + 0.321×(-0.42) + 0.156×0.31 + 0 = 0 - 0.135 + 0.048 = -0.087 Result: Z₂ = [[-0.087], [-0.142]] # Shape: (2, 1)
self.Z2 = np.dot(self.A1, self.W2) + self.b2

5Output Activation (Sigmoid)

Formula: A₂ = σ(Z₂) = 1 / (1 + e^(-Z₂))

Computation:

A₂[0] = 1 / (1 + e^0.087) = 1 / (1 + 1.091) = 0.478 A₂[1] = 1 / (1 + e^0.142) = 1 / (1 + 1.153) = 0.465 Result: A₂ = [[0.478], [0.465]] # Shape: (2, 1)

Interpretation: Sample 1 has 47.8% probability of ASD, Sample 2 has 46.5%

self.A2 = sigmoid(self.Z2) return self.A2

4. Loss Calculation

Binary Cross-Entropy Loss

Formula: L = -1/m × Σ[y_i × log(ŷ_i) + (1 - y_i) × log(1 - ŷ_i)] Where: m = number of samples (batch size) y_i = true label (0 or 1) ŷ_i = predicted probability

Example Calculation

True labels: y = [[1], # Sample 1 has ASD [0]] # Sample 2 does not have ASD Predicted probabilities: y_pred = [[0.478], [0.465]]

For Sample 1 (y=1):

Loss₁ = -(1 × log(0.478) + 0 × log(1-0.478)) = -log(0.478) = -(-0.738) = 0.738

For Sample 2 (y=0):

Loss₂ = -(0 × log(0.465) + 1 × log(1-0.465)) = -log(0.535) = -(-0.625) = 0.625
Total Loss: L = (0.738 + 0.625) / 2 = 0.682
def binary_cross_entropy(y_true, y_pred): eps = 1e-8 # Small value to prevent log(0) y_pred = np.clip(y_pred, eps, 1 - eps) return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

5. Backward Pass (Backpropagation)

Goal: Chain Rule

We need to find how much each weight contributed to the error, then adjust them accordingly.

Compute gradients: ∂L/∂W₁, ∂L/∂b₁, ∂L/∂W₂, ∂L/∂b₂

1Output Layer Gradient

Derivative of Loss w.r.t. Z₂: For binary cross-entropy + sigmoid, the derivative simplifies to: ∂L/∂Z₂ = ŷ - y (predicted - actual)

Mathematical Proof:

∂L/∂Z₂ = (∂L/∂A₂) × (∂A₂/∂Z₂) = [(ŷ - y)/(ŷ(1-ŷ))] × [ŷ(1-ŷ)] = ŷ - y

This beautiful simplification is why sigmoid + cross-entropy work so well together!

Computation: dZ₂ = [[0.478]] - [[1]] = [[-0.522]] [[0.465]] [[0]] [[ 0.465]] Shape: (2, 1)
dZ2 = y_pred - y.reshape(-1, 1)

2Output Layer Weight Gradient

Formula: ∂L/∂W₂ = (1/m) × A₁ᵀ · dZ₂

Chain rule: ∂L/∂W₂ = (∂L/∂Z₂) × (∂Z₂/∂W₂)

Since Z₂ = A₁·W₂ + b₂, we have ∂Z₂/∂W₂ = A₁

Computation: dW₂ = (1/2) × [[0.000, 0.000]]ᵀ · [[-0.522]] [[0.321, 0.487]] [[ 0.465]] [[0.156, 0.203]] dW₂[0] = 0.5 × (0.000×(-0.522) + 0.000×0.465) = 0.000 dW₂[1] = 0.5 × (0.321×(-0.522) + 0.487×0.465) = 0.032 dW₂[2] = 0.5 × (0.156×(-0.522) + 0.203×0.465) = 0.007 Result: dW₂ = [[ 0.000], [ 0.032], [ 0.007]] # Shape: (3, 1)
m = y.shape[0] dW2 = (1 / m) * np.dot(self.A1.T, dZ2)

3Output Layer Bias Gradient

Formula: ∂L/∂b₂ = mean(dZ₂) db₂ = mean([[-0.522], [0.465]]) = [[-0.029]] # Shape: (1, 1)
db2 = np.mean(dZ2, axis=0, keepdims=True)

4Hidden Layer Gradient (Backpropagate)

Formula: ∂L/∂A₁ = dZ₂ · W₂ᵀ Computation: dA₁ = [[-0.522]] · [[0.58, -0.42, 0.31]] [[ 0.465]] dA₁ = [[-0.303, 0.219, -0.162], [ 0.270, -0.195, 0.144]] # Shape: (2, 3)
dA1 = np.dot(dZ2, self.W2.T)

5Apply ReLU Derivative

Formula: ∂L/∂Z₁ = dA₁ × ReLU'(Z₁) ReLU Derivative: ⎧ 1 if x > 0 ReLU'(x) = ⎨ ⎩ 0 if x ≤ 0
Recall Z₁: Z₁ = [[-0.185, 0.321, 0.156], [-0.095, 0.487, 0.203]] ReLU Derivative Mask: ReLU'(Z₁) = [[0, 1, 1], # First neuron was negative [0, 1, 1]]

Apply Element-wise:

dA₁ = dA₁ × ReLU'(Z₁) = [[-0.303, 0.219, -0.162]] × [[0, 1, 1]] [[ 0.270, -0.195, 0.144]] [[0, 1, 1]] Result: = [[0.000, 0.219, -0.162], [0.000, -0.195, 0.144]]
dA1 *= relu_derivative(self.Z1)

6Hidden Layer Weight Gradient

Formula: ∂L/∂W₁ = (1/m) × Xᵀ · dA₁ Computation: dW₁ = (1/2) × [[0.2, 0.3]]ᵀ · [[0.000, 0.219, -0.162]] [[0.5, 0.7]] [[0.000, -0.195, 0.144]] [[0.1, 0.4]] [[0.8, 0.6]] Result: dW₁ = [[ 0.000, 0.007, -0.011], [ 0.000, -0.014, 0.020], [ 0.000, -0.017, 0.042], [ 0.000, 0.070, -0.043]] # Shape: (4, 3)
dW1 = (1 / m) * np.dot(X.T, dA1)

7Hidden Layer Bias Gradient

Formula: ∂L/∂b₁ = mean(dA₁, axis=0) db₁ = mean([[0.000, 0.219, -0.162], [0.000, -0.195, 0.144]], axis=0) = [[0.000, 0.012, -0.009]] # Shape: (1, 3)
db1 = np.mean(dA1, axis=0, keepdims=True)

All Gradients Computed!

Now we can update the weights using gradient descent.

6. Optimization (Gradient Descent)

Update Rule

Formula: θ_new = θ_old - learning_rate × gradient Where θ represents any parameter (W₁, b₁, W₂, b₂)

Intuition

Weights move in the opposite direction of the gradient to reduce loss.

  • If gradient is positive → weight decreases
  • If gradient is negative → weight increases
  • Larger gradient → bigger update

Example Weight Updates

Learning rate: lr = 0.005

Update W₂:

W₂_new = W₂_old - lr × dW₂ = [[ 0.58]] - 0.005 × [[ 0.000]] [[-0.42]] [[ 0.032]] [[ 0.31]] [[ 0.007]] = [[ 0.58000]] # Almost no change [[-0.42016]] # Decreased by 0.005×0.032 [[ 0.30996]] # Decreased by 0.005×0.007

Update b₂:

b₂_new = b₂_old - lr × db₂ = [[0.]] - 0.005 × [[-0.029]] = [[0.000145]]

Update W₁:

W₁_new = W₁_old - lr × dW₁ = [[ 0.35, -0.21, 0.48]] - 0.005 × [[ 0.000, 0.007, -0.011]] [[-0.15, 0.67, 0.12]] [[ 0.000, -0.014, 0.020]] [[ 0.44, -0.39, 0.55]] [[ 0.000, -0.017, 0.042]] [[-0.28, 0.19, -0.41]] [[ 0.000, 0.070, -0.043]] = [[ 0.35000, -0.21004, 0.48006]] [[-0.15000, 0.67007, 0.11990]] [[ 0.44000, -0.38991, 0.54979]] [[-0.28000, 0.18965, -0.40978]]
# Gradient descent updates self.W1 -= self.lr * dW1 self.b1 -= self.lr * db1 self.W2 -= self.lr * dW2 self.b2 -= self.lr * db2

Learning Rate Selection

Learning Rate Effect Result
Too High (0.1) Weights oscillate wildly Training diverges, loss increases
Too Low (0.00001) Tiny weight updates Training extremely slow
Just Right (0.001-0.01) Smooth, steady progress Optimal convergence

Advanced: L2 Regularization

Modified Update Rule: dW = gradient + (λ/m) × W # Add regularization term W_new = W_old - lr × dW

Purpose of L2 Regularization

Prevents overfitting by penalizing large weights.

λ (lambda) controls regularization strength (e.g., 0.01)

# With L2 regularization dW2 = (1 / m) * np.dot(self.A1.T, dZ2) + (self.l2_lambda / m) * self.W2 dW1 = (1 / m) * np.dot(X.T, dA1) + (self.l2_lambda / m) * self.W1

7. Training Loop

One Training Epoch

Training Cycle:

1. Shuffle Data (randomize order)

2. For Each Mini-Batch:

  • → Forward pass → get predictions
  • → Calculate loss
  • → Backward pass → compute gradients
  • → Update weights

3. Validate on validation set

4. Repeat for multiple epochs

Why Mini-Batch Training?

Method Batch Size Pros Cons
Batch Gradient Descent All samples (e.g., 1000) Smooth convergence, stable gradients Slow, high memory usage
Stochastic GD 1 sample Fast updates, low memory Very noisy, unstable
Mini-Batch GD Small groups (e.g., 16) Best balance: fast + stable Requires tuning batch size

Complete Training Code

def fit(self, X, y, epochs=100, batch_size=16, val_data=None): n = X.shape[0] history = {'loss': [], 'val_loss': [], 'accuracy': [], 'val_accuracy': []} for epoch in range(epochs): # 1. Shuffle data idx = np.random.permutation(n) X_shuffled, y_shuffled = X[idx], y[idx] batch_losses, batch_acc = [], [] # 2. Process each mini-batch for i in range(0, n, batch_size): X_batch = X_shuffled[i:i + batch_size] y_batch = y_shuffled[i:i + batch_size] # Forward pass y_pred = self.forward(X_batch, training=True) # Compute loss loss = binary_cross_entropy(y_batch, y_pred) # Backward pass self.backward(X_batch, y_batch, y_pred) # Calculate accuracy acc = np.mean((y_pred > 0.5).astype(int) .flatten() == y_batch) batch_losses.append(loss) batch_acc.append(acc) # 3. Validation val_loss, val_acc = 0, 0 if val_data: X_val, y_val = val_data y_val_pred = self.forward(X_val, training=False) val_loss = binary_cross_entropy(y_val, y_val_pred) val_acc = np.mean((y_val_pred > 0.5).astype(int) .flatten() == y_val) # 4. Record history history['loss'].append(np.mean(batch_losses)) history['accuracy'].append(np.mean(batch_acc)) history['val_loss'].append(val_loss) history['val_accuracy'].append(val_acc) if epoch % 10 == 0: print(f"Epoch {epoch+1}/{epochs} - " f"loss: {np.mean(batch_losses):.4f}, " f"val_acc: {val_acc:.2f}") return history

Early Stopping

Early Stopping Mechanism

Stop training when validation loss stops improving to prevent overfitting.

# Early stopping logic best_val_loss = float('inf') patience_counter = 0 patience = 20 # Stop after 20 epochs without improvement if val_loss < best_val_loss: best_val_loss = val_loss patience_counter = 0 # Save best weights best_weights = (self.W1.copy(), self.b1.copy(), self.W2.copy(), self.b2.copy()) else: patience_counter += 1 if patience_counter >= patience: print(f"Early stopping at epoch {epoch+1}") # Restore best weights self.W1, self.b1, self.W2, self.b2 = best_weights break

Training Progress Indicators

What to Watch During Training:

  • Training Loss: Should steadily decrease
  • Validation Loss: Should decrease, then plateau
  • Training Accuracy: Should increase
  • Validation Accuracy: Should increase, then stabilize

Warning Signs

  • Training loss increasing: Learning rate too high
  • Val loss >> Train loss: Overfitting (add regularization)
  • Both losses high: Underfitting (increase model capacity)
  • Val accuracy fluctuating wildly: Dataset too small

8. Summary: Complete Information Flow

Forward Pass (Making Predictions)

X → [×W₁ + b₁] → Z₁ → [ReLU] → A₁ → [×W₂ + b₂] → Z₂ → [Sigmoid] → ŷ

Step-by-Step:

  1. Linear transformation: Z₁ = X·W₁ + b₁
  2. Activation: A₁ = ReLU(Z₁) = max(0, Z₁)
  3. Linear transformation: Z₂ = A₁·W₂ + b₂
  4. Activation: ŷ = Sigmoid(Z₂) = 1/(1 + e^(-Z₂))

Backward Pass (Learning from Mistakes)

Loss ← [∂L/∂Z₂=ŷ-y] ← dZ₂ ← [∂L/∂W₂] ← dW₂ ↓ [×W₂ᵀ] → dA₁ → [×ReLU'] → dZ₁ → [∂L/∂W₁] → dW₁

Step-by-Step:

  1. Output gradient: dZ₂ = ŷ - y
  2. Weight gradient: dW₂ = (1/m) × A₁ᵀ·dZ₂
  3. Bias gradient: db₂ = mean(dZ₂)
  4. Backpropagate: dA₁ = dZ₂·W₂ᵀ
  5. Apply derivative: dA₁ = dA₁ × ReLU'(Z₁)
  6. Weight gradient: dW₁ = (1/m) × Xᵀ·dA₁
  7. Bias gradient: db₁ = mean(dA₁)

Weight Update

W_new = W_old - learning_rate × gradient

This cycle