📊 T-SMOTE Technical Manual

Temporal-SMOTE for Imbalanced Time-Series Classification

A Complete Mathematical, Practical, and Implementation Guide

Based on: "T-SMOTE: Temporal-oriented Synthetic Minority Oversampling Technique for Imbalanced Time Series Classification"
Authors: Microsoft Research | Published: IJCAI 2022

📑 Table of Contents

🎯 1. Introduction & Motivation

T-SMOTE (Temporal-SMOTE) is a specialized oversampling technique designed to address class imbalance in time-series classification problems while preserving temporal dependencies and dynamics.

1.1 The Class Imbalance Problem

In many real-world applications, datasets exhibit severe class imbalance where the minority class (positive cases) is significantly underrepresented compared to the majority class (negative cases). This imbalance causes several critical issues:

  • Model Bias: Classifiers tend to predict the majority class to maximize overall accuracy
  • Poor Minority Detection: The minority class, often the one we care about most (e.g., disease, fraud, failure), gets ignored
  • Evaluation Pitfalls: High accuracy can be misleading when 95% of data belongs to one class
  • Business Impact: Missing rare events can have severe consequences (missed diagnoses, undetected fraud, equipment failures)

🤔 Why Do We Need T-SMOTE?

The Real-World Scenarios:

Medical Diagnosis

In gait analysis for autism spectrum disorder (ASD) detection, you might have 1000 normal gait sequences but only 50 ASD cases. Without proper handling, your LSTM model will simply learn to classify everything as "normal" and achieve 95% accuracy—completely missing the point.

Equipment Failure Prediction

Industrial sensors record thousands of hours of normal operation but only a few failure events. Predicting these rare failures is crucial for maintenance scheduling and preventing costly downtime.

Financial Fraud Detection

Among millions of legitimate transactions, fraudulent ones are rare but extremely costly. The temporal pattern of how fraud develops is key to detection.

Why Traditional Methods Fail

Standard oversampling techniques like SMOTE treat time series as static feature vectors, destroying the temporal order that contains critical information about how patterns evolve and transition between classes.

1.2 Core Innovation of T-SMOTE

T-SMOTE introduces three fundamental innovations:

  1. Temporal Awareness: Treats time as a first-class dimension, not just another feature
  2. Progressive Subsequencing: Generates samples at different temporal positions using "leading times"
  3. Confidence-Guided Synthesis: Uses model predictions to guide interpolation, ensuring synthetic samples are realistic and useful
Core Principle: Instead of randomly mixing samples in feature space, T-SMOTE slides through time, creating synthetic sequences that represent earlier stages of pattern development. This allows models to learn transitional patterns—the gradual shift from normal to abnormal behavior.

1.3 Key Terminology

Time Series
A sequence of observations ordered in time, where each observation is a vector of features measured at a specific time point.
Class Imbalance
A situation where one class (minority) has significantly fewer samples than the other class (majority), typically with a ratio more extreme than 1:10.
Oversampling
A technique to balance class distribution by generating synthetic samples for the minority class.
Temporal Dependency
The relationship between observations at different time points, where the current state depends on previous states.
Decision Boundary
The hyperplane or surface that separates different classes in feature space. Samples near this boundary are hardest to classify.

1.4 When to Use T-SMOTE

T-SMOTE is particularly effective when:

✅ Ideal Use Cases

  • Your data is sequential (time series, sensor data, behavioral sequences)
  • You have severe class imbalance (minority class < 20% of total)
  • The temporal evolution of patterns is important (not just final state)
  • You have a pretrained classifier that can provide confidence scores
  • Transitional patterns matter (how normal becomes abnormal)

⚠️ Not Recommended When

  • Your data is static/tabular without temporal order → use standard SMOTE
  • Classes are already balanced → no oversampling needed
  • You have very short sequences (e.g., <10 time steps) → limited room for subsequencing
  • Temporal order doesn't matter to the classification task

⚠️ 2. The Problem with Standard SMOTE

2.1 How Standard SMOTE Works

SMOTE (Synthetic Minority Over-sampling Technique), introduced by Chawla et al. in 2002, is a foundational technique for handling imbalanced data in traditional machine learning.

Mathematical Formulation

For static feature vectors:

X = [x₁, x₂, ..., xₙ] ∈ ℝⁿ, y ∈ {0,1}

SMOTE generates synthetic samples by linear interpolation:

X_new = X_A + λ(X_B - X_A) where: - X_A: a minority sample - X_B: one of its k-nearest neighbors (also minority) - λ: random value ∈ [0,1]

Step-by-Step Process

Standard SMOTE Algorithm

  1. Select a minority class sample X_A
  2. Find its k nearest neighbors (typically k=5) in feature space
  3. Randomly choose one neighbor X_B
  4. Generate random λ ∈ [0,1]
  5. Create synthetic sample: X_new = X_A + λ(X_B - X_A)
  6. Repeat until desired class balance is achieved

📊 Concrete Example: Credit Scoring

Sample A (defaulter):

Age Income Credit Score
35 45000 580

Sample B (defaulter, nearest neighbor):

Age Income Credit Score
40 50000 600

With λ = 0.6:

New Sample = 0.6 × A + 0.4 × B Age = 0.6(35) + 0.4(40) = 37 Income = 0.6(45000) + 0.4(50000) = 47000 Credit Score = 0.6(580) + 0.4(600) = 596

This works perfectly because these features are independent and static.

2.2 Why SMOTE Fails for Time Series

🚫 Critical Failure Modes

Problem 1: Temporal Order Destruction

When you flatten a time series into a feature vector, you lose the sequential structure:

Original: [frame₁, frame₂, frame₃, frame₄, frame₅] Flattened: [f₁₁, f₁₂, f₁₃, f₂₁, f₂₂, f₂₃, ..., f₅₃]

SMOTE then treats f₁₁ (feature 1 at time 1) and f₃₂ (feature 2 at time 3) as if they're interchangeable—completely ignoring that they occur at different times.

Problem 2: Unrealistic Temporal Mixing

SMOTE might interpolate between samples at completely different temporal phases:

  • Mixing the beginning of one gait cycle with the end of another
  • Combining early-stage failure indicators with late-stage indicators
  • Blending different phases of a heartbeat cycle

Result: Physically impossible synthetic sequences

Problem 3: Loss of Dynamics

Time series contain information in their dynamics—velocity, acceleration, trends. SMOTE interpolation destroys these:

  • Smooth trends become jagged
  • Periodic patterns get distorted
  • Temporal correlations are broken

🎭 Illustrative Example: Gait Analysis

Sequence A (ASD gait): Complete gait cycle from heel strike to toe-off

Phase:
Heel Strike
Loading
Mid-stance
Push-off
Swing

Sequence B (ASD gait): Similar but different timing

Phase:
Loading
Mid-stance
Push-off
Swing
Heel Strike

What SMOTE produces: Random mixing of phases

Synthetic:
0.6×Heel + 0.4×Loading
0.6×Loading + 0.4×Mid
0.6×Mid + 0.4×Push
0.6×Push + 0.4×Swing
0.6×Swing + 0.4×Heel

This creates biomechanically impossible movement patterns!

2.3 Comparison: What Works vs. What Doesn't

✅ SMOTE Works Great For:

  • Tabular data: Customer demographics, financial ratios
  • Image features: Pixel values, color histograms
  • Static measurements: Lab test results, survey responses
  • Independent features: Where feature order doesn't matter

Why? These features don't have temporal dependencies

❌ SMOTE Fails For:

  • Time series: Sensor readings, physiological signals
  • Sequential data: Video frames, speech signals
  • Behavioral sequences: User actions, transaction patterns
  • Temporal patterns: Where order and dynamics are crucial

Why? Temporal dependencies get destroyed

3. Understanding Time-Series Structure

3.1 Mathematical Representation

A time series is fundamentally different from static data because it contains ordered observations over time.

Formal Definition

Xᵢ = [x¹ᵢ, x²ᵢ, x³ᵢ, ..., xᵀᵢ] where: - i: sample index - T: total number of time steps (sequence length) - xᵗᵢ ∈ ℝᵈ: feature vector at time t - d: number of features (dimensions)

📝 Understanding the Notation

Subscript i: Identifies which time series (e.g., patient #5, sensor #12)

Superscript t: Identifies the time step within that series

Example: x³₅ means "features at time step 3 of time series #5"

3.2 Anatomy of Time-Series Data

🚶 Concrete Example: Gait Analysis Data

Setup: Motion capture of a person walking for 2 seconds at 30 FPS

  • T = 60 time steps (frames)
  • d = 12 features per frame (4 joints × 3 coordinates each)
  • Features: [hip_x, hip_y, hip_z, knee_x, knee_y, knee_z, ankle_x, ankle_y, ankle_z, foot_x, foot_y, foot_z]

Data structure:

Time hip_x hip_y hip_z ... foot_z
t=1 0.45 0.92 0.15 ... 0.02
t=2 0.46 0.93 0.16 ... 0.03
... ... ... ... ... ...
t=60 0.52 0.89 0.18 ... 0.01

Shape: (60, 12) — a matrix where each row is one time step

3.3 What Makes Time Series Special

🎯 Critical Properties of Time Series

1. Temporal Order Matters

Frame 5 → Frame 6 → Frame 7 represents physical reality. Reversing or shuffling this order creates meaningless data.

2. Temporal Dependencies

Current values depend on past values:

x^t depends on x^(t-1), x^(t-2), ..., x^1

Example: Your foot position at frame 10 is influenced by where it was at frame 9.

3. Patterns Evolve Over Time

The transition from normal to abnormal happens gradually:

  • Frames 1-20: Normal walking
  • Frames 21-40: Subtle asymmetry appears
  • Frames 41-60: Clear ASD gait pattern

4. Dynamics Matter

Not just position, but velocity and acceleration:

  • Position: Where the joint is
  • Velocity: How fast it's moving (x^t - x^(t-1))
  • Acceleration: How velocity changes

3.4 Challenges in Time-Series Classification

🎯 Key Challenges

Variable Length

Different sequences may have different lengths (some walks are longer than others). Solutions: padding, truncation, or subsequencing.

Temporal Misalignment

Similar patterns may occur at different time offsets. One person's gait cycle might start at frame 5, another's at frame 15.

High Dimensionality

With T=60 and d=12, you have 720 features. This creates the "curse of dimensionality" problem.

Class Imbalance

In medical/industrial applications, abnormal cases are rare. This is where T-SMOTE comes in!

3.5 Why Standard Methods Fail

Visualization: What Happens When You Flatten Time Series

Original time series (meaningful):

Time →
Frame 1
[12 features]
Frame 2
[12 features]
Frame 3
[12 features]
...
Frame 60
[12 features]

After flattening for SMOTE (order lost):

[f₁₁, f₁₂, ..., f₁₁₂, f₂₁, f₂₂, ..., f₂₁₂, ..., f₆₀₁, ..., f₆₀₁₂]

Now it's just a 720-dimensional vector. The model has no way to know that f₁₁ (hip_x at time 1) should be close to f₂₁ (hip_x at time 2).

The Core Insight: Time series are not just collections of features—they're trajectories through feature space over time. Any augmentation technique must preserve this trajectory structure.

📏 4. The Leading Time Concept

The leading time is T-SMOTE's most innovative concept. It captures the idea that for classification tasks with temporal events (like failure, disease onset, or pattern occurrence), the most informative samples are those that capture the transition period—not just the final state.

4.1 Mathematical Definition

Leading Time (l)
The temporal offset from the end of a sequence. It determines how far back in time we extract a subsequence.
Subsequence with Leading Time l
X⁽ˡ⁾ᵢ = [xᵢ^(T-l-w+1), xᵢ^(T-l-w+2), ..., xᵢ^(T-l)] where: - T: total sequence length - w: window size (subsequence length) - l: leading time (0, 1, 2, ..., L)

Intuitive Understanding

Think of leading time as "rewinding" the sequence:

  • l=0: The most recent w frames (ending at time T)
  • l=1: One step earlier (ending at time T-1)
  • l=2: Two steps earlier (ending at time T-2)
  • And so on...

🤔 Why Do We Need Leading Time?

The Problem with Just Using Final Frames

If you only look at the last window (l=0) for all positive samples:

  • All samples are deep in the positive region
  • Model doesn't learn the transition from negative → positive
  • Can't detect early-stage patterns
  • Poor performance on borderline cases

What Leading Time Achieves

By creating subsequences at different leading times:

  • l=0: Captures fully developed positive pattern (high confidence)
  • l=3: Captures mid-stage pattern (medium confidence)
  • l=7: Captures early-stage pattern (low confidence, near boundary)

This gives the model examples of how patterns evolve, not just their final state.

4.2 Visual Demonstration

Example: 10-Frame Sequence with Window Size w=5

Complete original sequence:

Frames:
1
2
3
4
5
6
7
8
9
10

Opacity represents pattern strength: darker = more obvious ASD pattern


X⁽⁰⁾ (l=0): Last 5 frames [6,7,8,9,10]

Extracted:
1-5
6
7
8
9
10

Model confidence: s⁽⁰⁾ = 0.95 (very confident this is ASD)

Meaning: Clear, fully-developed ASD gait pattern


X⁽¹⁾ (l=1): Frames [5,6,7,8,9]

Extracted:
1-4
5
6
7
8
9
10

Model confidence: s⁽¹⁾ = 0.78 (fairly confident)

Meaning: Pattern is developing but not fully established


X⁽²⁾ (l=2): Frames [4,5,6,7,8]

Extracted:
1-3
4
5
6
7
8
9-10

Model confidence: s⁽²⁾ = 0.54 (uncertain, borderline)

Meaning: Transition phase—could be ASD or normal


X⁽³⁾ (l=3): Frames [3,4,5,6,7]

Extracted:
1-2
3
4
5
6
7
8-10

Model confidence: s⁽³⁾ = 0.32 (looks more normal)

Meaning: Early stage, before pattern fully emerges

The Magic of Leading Time: By generating subsequences at different leading times, we create a temporal spectrum from "clearly positive" to "borderline" to "almost negative." This teaches the model to recognize patterns at all stages of development.

4.3 Calculating Leading Time Indices

Step-by-Step Calculation

Given:

  • Total sequence length: T = 10
  • Window size: w = 5
  • Leading time: l = 2

Calculate start and end indices:

Start index: T - l - w + 1 = 10 - 2 - 5 + 1 = 4 End index: T - l = 10 - 2 = 8 Therefore: X⁽²⁾ = [x₄, x₅, x₆, x₇, x₈]

Verify:

  • Length = 8 - 4 + 1 = 5 ✓ (matches window size)
  • Ends at T-l = 8 ✓ (two steps before the end)

4.4 Choosing Maximum Leading Time (L)

📐 Practical Guidelines

Maximum Possible Value

L_max = T - w

This ensures you don't try to extract a subsequence that starts before the beginning of the series.

Recommended Range

Typically, L is set to capture the meaningful transition period:

  • Short sequences (T < 50): L = 3 to 5
  • Medium sequences (50 ≤ T ≤ 200): L = 5 to 10
  • Long sequences (T > 200): L = 10 to 20

Domain Knowledge Matters

For gait analysis: If a gait cycle is ~30 frames and the transition takes ~15 frames, set L ≈ 15

For equipment failure: If failure indicators appear 100 time steps before failure, set L ≈ 100

4.5 Why Not Just Use the Entire Sequence?

❌ Using Full Sequence

  • Very long input to model
  • Computational cost increases
  • Early frames may be irrelevant noise
  • Harder to learn what's important
  • Doesn't focus on transition period

✅ Using Subsequences with Leading Time

  • Fixed-length windows (easier to batch)
  • Computationally efficient
  • Focuses on relevant time period
  • Multiple training examples from one sequence
  • Captures temporal evolution

🎯 5. Model Confidence Scores

Model confidence scores are the bridge between the temporal subsequences and the synthetic sample generation. They tell us how "positive" each subsequence looks, which guides how we mix them.

5.1 What is Model Confidence?

Model Confidence Score (s⁽ˡ⁾ᵢ)
The predicted probability that subsequence X⁽ˡ⁾ᵢ belongs to the positive (minority) class, as output by a trained classifier.
Mathematical Form
s⁽ˡ⁾ᵢ = f(X⁽ˡ⁾ᵢ) ∈ [0, 1] where: - f: trained classifier (LSTM, CNN, etc.) - s⁽ˡ⁾ᵢ: confidence score - Range: 0 (definitely negative) to 1 (definitely positive)

🤔 Why Do We Need Confidence Scores?

Problem: Not All Subsequences Are Equally Useful

Consider two subsequences from an ASD gait sequence:

  • X⁽⁰⁾: Last 5 frames — clearly shows ASD pattern
  • X⁽⁸⁾: Very early frames — looks completely normal

If we mix these randomly with equal weight, we might generate samples that are too normal to be useful, or too mixed to be realistic.

Solution: Use Model's Own Assessment

The trained model can tell us:

  • Which subsequences are "deep" in the positive region (high confidence)
  • Which are borderline (medium confidence)
  • Which look negative (low confidence)

We use this information to intelligently guide the mixing process.

5.2 Computing Confidence Scores

Step-by-Step Process

Train a Base Classifier

First, train an initial classifier on your imbalanced dataset (before applying T-SMOTE). This can be:

  • LSTM (for sequential dependencies)
  • 1D CNN (for local patterns)
  • Transformer (for long-range dependencies)
  • Any model that outputs probabilities

Note: This doesn't need to be perfect—it just needs to provide reasonable confidence estimates.

Generate Subsequences

For each positive sample in your dataset, create subsequences at different leading times:

X⁽⁰⁾ᵢ, X⁽¹⁾ᵢ, X⁽²⁾ᵢ, ..., X⁽ᴸ⁾ᵢ

Run Through Classifier

Pass each subsequence through the trained model to get probability outputs:

Python Example
# Assuming you have a trained Keras/PyTorch model for l in range(L+1): subseq = extract_subsequence(X_i, l, window_size) confidence = model.predict_proba(subseq)[0, 1] # Prob of positive class scores[l] = confidence

Store and Use

Store these confidence scores—you'll use them to:

  • Determine mixing weights (via Beta distribution)
  • Calculate synthetic sample confidences
  • Filter unreliable samples (via weighted sampling)

5.3 Interpreting Confidence Scores

Score Range Interpretation Position in Feature Space Usefulness for Training
0.9 - 1.0 Very high confidence positive Deep in positive region Good for establishing class center
0.7 - 0.9 Confident positive Solidly in positive region Most useful for training
0.5 - 0.7 Likely positive Approaching decision boundary Critical for learning boundaries
0.3 - 0.5 Uncertain/borderline Near or on decision boundary Handle carefully—may be mislabeled
0.0 - 0.3 Looks negative In negative region Likely mislabeled or very early stage

5.4 The Confidence Progression Pattern

📊 Typical Pattern for an ASD Gait Sequence

Sequence Length: T = 60 frames, Window: w = 20 frames

Leading Time (l) Frames Used Confidence (s⁽ˡ⁾) Description
0 [41-60] 0.94 Clear ASD pattern established
5 [36-55] 0.88 Pattern visible but less pronounced
10 [31-50] 0.76 Transitional phase beginning
15 [26-45] 0.58 Subtle abnormalities emerging
20 [21-40] 0.42 Mostly normal with hints
25 [16-35] 0.31 Appears normal

Key Observation: Confidence scores decrease as we go back in time, showing the gradual evolution from normal to ASD gait.

Why This Matters: The confidence scores encode the model's understanding of how patterns develop over time. By using these scores to guide synthesis, we ensure synthetic samples respect the natural progression of the pattern.

5.5 Edge Cases and Considerations

⚠️ Common Pitfalls

1. Poor Base Classifier

Problem: If your initial classifier is terrible (random guessing), confidence scores will be meaningless.

Solution: Ensure your base classifier achieves at least moderate performance (e.g., AUC > 0.6) before using T-SMOTE.

2. All High Confidences

Problem: If all subsequences have confidence > 0.9, you're not capturing the transition.

Solution: Increase maximum leading time L to go further back in time.

3. All Low Confidences

Problem: If all scores are < 0.5, the sample might be mislabeled.

Solution: Review labels or exclude this sample from augmentation.

4. Non-Monotonic Progression

Problem: Sometimes scores don't decrease smoothly (e.g., s⁽³⁾ > s⁽¹⁾).

Solution: This is normal due to noise. T-SMOTE is robust to small fluctuations.

💡 Pro Tip: Warm-Start Strategy

If your initial dataset is very imbalanced (e.g., 1:100), your base classifier might struggle. Try this approach:

  1. Apply simple oversampling (duplication) to get to 1:10 ratio
  2. Train a base classifier on this
  3. Use this classifier to compute T-SMOTE confidence scores
  4. Apply T-SMOTE to get to 1:1 ratio
  5. Train final classifier on T-SMOTE augmented data

📊 6. Beta Distribution & Mixing Weights

The Beta distribution is the mathematical heart of T-SMOTE. It determines how to mix two temporal neighbors based on their confidence scores, ensuring synthetic samples are both diverse and realistic.

6.1 What is the Beta Distribution?

Beta Distribution
A continuous probability distribution defined on the interval [0, 1], parameterized by two shape parameters α and β (often denoted as a and b).
Mathematical Form
X ~ Beta(a, b) Probability Density Function: f(x; a, b) = (x^(a-1) * (1-x)^(b-1)) / B(a, b) where B(a,b) is the Beta function (normalization constant) Mean: E[X] = a / (a + b) Variance: Var[X] = (a*b) / ((a+b)²(a+b+1))

6.2 Why Beta Distribution?

🎯 Perfect Properties for Our Task

1. Bounded to [0,1]

For interpolation X_new = α·X⁽ˡ⁾ + (1-α)·X⁽ˡ⁺¹⁾, we need α ∈ [0,1]. Beta naturally lives in this range.

2. Flexible Shapes

By varying parameters a and b, Beta can be:

  • Uniform: Beta(1,1) → equal probability for all α
  • Left-skewed: Beta(0.3, 0.8) → favors small α
  • Right-skewed: Beta(0.8, 0.3) → favors large α
  • Bell-shaped: Beta(5, 5) → concentrated around 0.5

3. Natural Interpretation

When a = s⁽ˡ⁾ and b = s⁽ˡ⁺¹⁾:

  • If s⁽ˡ⁾ > s⁽ˡ⁺¹⁾: α tends toward 1 → more weight on recent (confident) subsequence
  • If s⁽ˡ⁾ < s⁽ˡ⁺¹⁾: α tends toward 0 → more weight on earlier subsequence
  • If s⁽ˡ⁾ ≈ s⁽ˡ⁺¹⁾: α around 0.5 → balanced mixing

4. Incorporates Model Knowledge

By using confidence scores as parameters, we let the model's own assessment guide the augmentation process.

6.3 T-SMOTE's Use of Beta

The Formula

α ~ Beta(s⁽ˡ⁾ᵢ, s⁽ˡ⁺¹⁾ᵢ) where: - s⁽ˡ⁾ᵢ: confidence of subsequence at leading time l - s⁽ˡ⁺¹⁾ᵢ: confidence of subsequence at leading time l+1 - α: sampled mixing weight

Intuitive Interpretation

Think of a and b as "votes" for which subsequence to favor:

  • a = s⁽ˡ⁾ = 0.9: 9 votes for X⁽ˡ⁾ (recent, confident)
  • b = s⁽ˡ⁺¹⁾ = 0.4: 4 votes for X⁽ˡ⁺¹⁾ (earlier, less confident)
  • Expected α: 0.9/(0.9+0.4) ≈ 0.69 → leans toward recent one

6.4 Visual Examples

Case 1: High vs Low Confidence

Parameters: Beta(0.90, 0.30)

  • Mean: 0.90/(0.90+0.30) = 0.75
  • Shape: Strongly right-skewed
  • Interpretation: Most samples will have α around 0.7-0.8, heavily favoring X⁽ˡ⁾

Distribution Shape:

Probability
    ▲
 1.5│                    ╱█
    │                   ╱ █
 1.0│                  ╱  █
    │                 ╱   █
 0.5│         ▁▁▁▁▁▁▁╱    █
    │  ▁▁▁▁▁▁▁             █
 0.0└────────────────────────────▶ α
    0   0.2  0.4  0.6  0.8  1.0
        

Result: Synthetic samples will be very similar to X⁽⁰⁾ (recent, confident)


Case 2: Similar Confidences

Parameters: Beta(0.65, 0.60)

  • Mean: 0.65/(0.65+0.60) = 0.52
  • Shape: Roughly symmetric
  • Interpretation: Balanced mixing with slight favor to X⁽ˡ⁾

Distribution Shape:

Probability
    ▲
 1.5│         ╱█╲
    │        ╱ █ ╲
 1.0│       ╱  █  ╲
    │      ╱   █   ╲
 0.5│  ▁▁▁╱    █    ╲▁▁▁
    │ ▁▁▁      █      ▁▁▁
 0.0└────────────────────────────▶ α
    0   0.2  0.4  0.6  0.8  1.0
        

Result: Diverse synthetic samples spanning both subsequences

6.5 Numerical Example

🔢 Step-by-Step Calculation

Setup:

  • Subsequence X⁽¹⁾: confidence s⁽¹⁾ = 0.84
  • Subsequence X⁽²⁾: confidence s⁽²⁾ = 0.58

Set Beta Parameters

a = s⁽¹⁾ = 0.84 b = s⁽²⁾ = 0.58

Calculate Expected Value

E[α] = a/(a+b) = 0.84/(0.84+0.58) = 0.84/1.42 ≈ 0.592

Interpretation: On average, synthetic samples will be 59.2% from X⁽¹⁾ and 40.8% from X⁽²⁾

Sample α (in practice)

Python Implementation
import numpy as np # Sample mixing weight alpha = np.random.beta(0.84, 0.58) # Example output: alpha = 0.627 print(f"Sampled α: {alpha:.3f}") # This specific sample: 62.7% from X⁽¹⁾, 37.3% from X⁽²⁾

Create Synthetic Sample

X_new = 0.627 × X⁽¹⁾ + 0.373 × X⁽²⁾

This preserves temporal structure while creating a slightly earlier version of the pattern.

6.6 Why Not Simpler Alternatives?

❌ Uniform Random (α ~ U[0,1])

  • Ignores confidence information
  • Treats all mixing equally likely
  • Could create unrealistic samples
  • No model guidance

Example: Might mix 90% confident with 30% confident subsequence using α=0.1, creating mostly negative-looking sample

✅ Beta Distribution

  • Incorporates confidence
  • Adaptively weights mixing
  • Creates realistic samples
  • Model-guided augmentation

Same scenario: Beta(0.9, 0.3) naturally produces α around 0.7-0.8, keeping samples positive

❌ Fixed α (e.g., α=0.5)

  • No diversity
  • All synthetics are identical
  • Overfitting risk
  • Doesn't explore space

✅ Sampled α from Beta

  • Natural diversity
  • Different synthetics each time
  • Better generalization
  • Explores around mean
The Brilliance of Beta: It automatically adjusts the mixing strategy based on how confident the model is about each subsequence. High confidence differences → skewed mixing (favor confident one). Similar confidences → balanced mixing (explore between them).

⚗️ 7. Synthesizing New Samples

Now we bring everything together: temporal subsequences, confidence scores, and Beta-sampled mixing weights combine to create synthetic time-series samples that are both temporally coherent and strategically positioned in feature space.

7.1 The Core Synthesis Formula

X_new = α · X⁽ˡ⁾ᵢ + (1-α) · X⁽ˡ⁺¹⁾ᵢ where: - X⁽ˡ⁾ᵢ, X⁽ˡ⁺¹⁾ᵢ: two consecutive temporal subsequences - α ~ Beta(s⁽ˡ⁾ᵢ, s⁽ˡ⁺¹⁾ᵢ): mixing weight - X_new: synthetic subsequence (w × d matrix) This operation is performed ELEMENT-WISE on all time steps and features.

📐 Element-Wise Operation

The interpolation happens for every single value in the matrices:

For each time step t ∈ [1, w] and feature f ∈ [1, d]: X_new[t, f] = α · X⁽ˡ⁾[t, f] + (1-α) · X⁽ˡ⁺¹⁾[t, f]

This ensures temporal coherence—we're not mixing different time steps!

7.2 Complete Example with Real Numbers

🔢 Full Worked Example

Scenario: 3-feature gait data, window size w=4

Step 1: Two Temporal Subsequences

X⁽⁰⁾ (recent, l=0, confidence = 0.84):

Timehip_xknee_xankle_x
11.60.92.5
21.81.02.7
32.01.12.9
42.21.33.0

X⁽¹⁾ (earlier, l=1, confidence = 0.58):

Timehip_xknee_xankle_x
11.30.82.3
21.60.92.5
31.81.02.7
42.01.12.9

Step 2: Sample Mixing Weight

α ~ Beta(0.84, 0.58) Let's say we sample: α = 0.59 Therefore: (1-α) = 0.41

Step 3: Compute Synthetic Sample (Element-by-Element)

Time step 1:

hip_x: 0.59×1.6 + 0.41×1.3 = 0.944 + 0.533 = 1.477

knee_x: 0.59×0.9 + 0.41×0.8 = 0.531 + 0.328 = 0.859

ankle_x: 0.59×2.5 + 0.41×2.3 = 1.475 + 0.943 = 2.418

Time step 2:

hip_x: 0.59×1.8 + 0.41×1.6 = 1.062 + 0.656 = 1.718

knee_x: 0.59×1.0 + 0.41×0.9 = 0.590 + 0.369 = 0.959

ankle_x: 0.59×2.7 + 0.41×2.5 = 1.593 + 1.025 = 2.618

Time step 3:

hip_x: 0.59×2.0 + 0.41×1.8 = 1.180 + 0.738 = 1.918

knee_x: 0.59×1.1 + 0.41×1.0 = 0.649 + 0.410 = 1.059

ankle_x: 0.59×2.9 + 0.41×2.7 = 1.711 + 1.107 = 2.818

Time step 4:

hip_x: 0.59×2.2 + 0.41×2.0 = 1.298 + 0.820 = 2.118

knee_x: 0.59×1.3 + 0.41×1.1 = 0.767 + 0.451 = 1.218

ankle_x: 0.59×3.0 + 0.41×2.9 = 1.770 + 1.189 = 2.959

Step 4: Final Synthetic Sample (X_new)

Timehip_xknee_xankle_x
11.4770.8592.418
21.7180.9592.618
31.9181.0592.818
42.1181.2182.959

✅ Verification:

  • All values lie between X⁽⁰⁾ and X⁽¹⁾ ✓
  • Temporal progression is smooth ✓
  • Closer to X⁽⁰⁾ (since α=0.59) ✓
  • Physically plausible joint positions ✓

7.3 Synthetic Label Confidence

Along with the synthetic sequence, we also compute its expected confidence:

s_new = α · s⁽ˡ⁾ᵢ + (1-α) · s⁽ˡ⁺¹⁾ᵢ

Continuing Our Example:

s_new = 0.59 × 0.84 + 0.41 × 0.58 = 0.4956 + 0.2378 = 0.7334

Interpretation: The synthetic sample is expected to have ~73% confidence of being ASD—still positive, but closer to the decision boundary than X⁽⁰⁾.

7.4 Why This Works: The Geometry

Geometric Interpretation in Feature Space

Imagine plotting confidence scores along a temporal axis:

Confidence
    1.0│                         ● X⁽⁰⁾ (0.84)
       │                        ╱
    0.8│                    ★  ← X_new (0.73)
       │                   ╱
    0.6│              ● X⁽¹⁾ (0.58)
       │             ╱
    0.4│        ● X⁽²⁾
       │       ╱
    0.2│  ● X⁽³⁾
       │
    0.0└─────────────────────────────────▶ Time
       earlier                       recent
    

Key Points:

  • Synthetic sample (★