T-SMOTE: Complete Technical Manual

🎯 1. Introduction & Motivation

T-SMOTE (Temporal-SMOTE) is a specialized oversampling technique designed to address class imbalance in time-series classification problems while preserving temporal dependencies and dynamics.

1.1 The Class Imbalance Problem

In many real-world applications, datasets exhibit severe class imbalance where the minority class (positive cases) is significantly underrepresented compared to the majority class (negative cases). This imbalance causes several critical issues:

Model Bias: Classifiers tend to predict the majority class to maximize overall accuracy
Poor Minority Detection: The minority class, often the one we care about most (e.g., disease, fraud, failure), gets ignored
Evaluation Pitfalls: High accuracy can be misleading when 95% of data belongs to one class
Business Impact: Missing rare events can have severe consequences (missed diagnoses, undetected fraud, equipment failures)

🤔 Why Do We Need T-SMOTE?

The Real-World Scenarios:

Medical Diagnosis

In gait analysis for autism spectrum disorder (ASD) detection, you might have 1000 normal gait sequences but only 50 ASD cases. Without proper handling, your LSTM model will simply learn to classify everything as "normal" and achieve 95% accuracy—completely missing the point.

Equipment Failure Prediction

Industrial sensors record thousands of hours of normal operation but only a few failure events. Predicting these rare failures is crucial for maintenance scheduling and preventing costly downtime.

Financial Fraud Detection

Among millions of legitimate transactions, fraudulent ones are rare but extremely costly. The temporal pattern of how fraud develops is key to detection.

Why Traditional Methods Fail

Standard oversampling techniques like SMOTE treat time series as static feature vectors, destroying the temporal order that contains critical information about how patterns evolve and transition between classes.

1.2 Core Innovation of T-SMOTE

T-SMOTE introduces three fundamental innovations:

Temporal Awareness: Treats time as a first-class dimension, not just another feature
Progressive Subsequencing: Generates samples at different temporal positions using "leading times"
Confidence-Guided Synthesis: Uses model predictions to guide interpolation, ensuring synthetic samples are realistic and useful

Core Principle: Instead of randomly mixing samples in feature space, T-SMOTE slides through time, creating synthetic sequences that represent earlier stages of pattern development. This allows models to learn transitional patterns—the gradual shift from normal to abnormal behavior.

1.3 Key Terminology

Time Series: A sequence of observations ordered in time, where each observation is a vector of features measured at a specific time point.
Class Imbalance: A situation where one class (minority) has significantly fewer samples than the other class (majority), typically with a ratio more extreme than 1:10.
Oversampling: A technique to balance class distribution by generating synthetic samples for the minority class.
Temporal Dependency: The relationship between observations at different time points, where the current state depends on previous states.
Decision Boundary: The hyperplane or surface that separates different classes in feature space. Samples near this boundary are hardest to classify.

1.4 When to Use T-SMOTE

T-SMOTE is particularly effective when:

✅ Ideal Use Cases

Your data is sequential (time series, sensor data, behavioral sequences)
You have severe class imbalance (minority class < 20% of total)
The temporal evolution of patterns is important (not just final state)
You have a pretrained classifier that can provide confidence scores
Transitional patterns matter (how normal becomes abnormal)

⚠️ Not Recommended When

Your data is static/tabular without temporal order → use standard SMOTE
Classes are already balanced → no oversampling needed
You have very short sequences (e.g., <10 time steps) → limited room for subsequencing
Temporal order doesn't matter to the classification task

⚠️ 2. The Problem with Standard SMOTE

2.1 How Standard SMOTE Works

SMOTE (Synthetic Minority Over-sampling Technique), introduced by Chawla et al. in 2002, is a foundational technique for handling imbalanced data in traditional machine learning.

Mathematical Formulation

For static feature vectors:

X = [x₁, x₂, ..., xₙ] ∈ ℝⁿ, y ∈ {0,1}

SMOTE generates synthetic samples by linear interpolation:

X_new = X_A + λ(X_B - X_A) where: - X_A: a minority sample - X_B: one of its k-nearest neighbors (also minority) - λ: random value ∈ [0,1]

Step-by-Step Process

Standard SMOTE Algorithm

Select a minority class sample X_A
Find its k nearest neighbors (typically k=5) in feature space
Randomly choose one neighbor X_B
Generate random λ ∈ [0,1]
Create synthetic sample: X_new = X_A + λ(X_B - X_A)
Repeat until desired class balance is achieved

📊 Concrete Example: Credit Scoring

Sample A (defaulter):

Age	Income	Credit Score
35	45000	580

Sample B (defaulter, nearest neighbor):

Age	Income	Credit Score
40	50000	600

With λ = 0.6:

New Sample = 0.6 × A + 0.4 × B Age = 0.6(35) + 0.4(40) = 37 Income = 0.6(45000) + 0.4(50000) = 47000 Credit Score = 0.6(580) + 0.4(600) = 596

✅ This works perfectly because these features are independent and static.

2.2 Why SMOTE Fails for Time Series

🚫 Critical Failure Modes

Problem 1: Temporal Order Destruction

When you flatten a time series into a feature vector, you lose the sequential structure:

Original: [frame₁, frame₂, frame₃, frame₄, frame₅] Flattened: [f₁₁, f₁₂, f₁₃, f₂₁, f₂₂, f₂₃, ..., f₅₃]

SMOTE then treats f₁₁ (feature 1 at time 1) and f₃₂ (feature 2 at time 3) as if they're interchangeable—completely ignoring that they occur at different times.

Problem 2: Unrealistic Temporal Mixing

SMOTE might interpolate between samples at completely different temporal phases:

Mixing the beginning of one gait cycle with the end of another
Combining early-stage failure indicators with late-stage indicators
Blending different phases of a heartbeat cycle

Result: Physically impossible synthetic sequences

Problem 3: Loss of Dynamics

Time series contain information in their dynamics—velocity, acceleration, trends. SMOTE interpolation destroys these:

Smooth trends become jagged
Periodic patterns get distorted
Temporal correlations are broken

🎭 Illustrative Example: Gait Analysis

Sequence A (ASD gait): Complete gait cycle from heel strike to toe-off

Phase:

Heel Strike

Loading

Mid-stance

Push-off

Swing

Sequence B (ASD gait): Similar but different timing

Phase:

Loading

Mid-stance

Push-off

Swing

Heel Strike

What SMOTE produces: Random mixing of phases

Synthetic:

0.6×Heel + 0.4×Loading

0.6×Loading + 0.4×Mid

0.6×Mid + 0.4×Push

0.6×Push + 0.4×Swing

0.6×Swing + 0.4×Heel

❌ This creates biomechanically impossible movement patterns!

2.3 Comparison: What Works vs. What Doesn't

✅ SMOTE Works Great For:

Tabular data: Customer demographics, financial ratios
Image features: Pixel values, color histograms
Static measurements: Lab test results, survey responses
Independent features: Where feature order doesn't matter

Why? These features don't have temporal dependencies

❌ SMOTE Fails For:

Time series: Sensor readings, physiological signals
Sequential data: Video frames, speech signals
Behavioral sequences: User actions, transaction patterns
Temporal patterns: Where order and dynamics are crucial

Why? Temporal dependencies get destroyed

⏰ 3. Understanding Time-Series Structure

3.1 Mathematical Representation

A time series is fundamentally different from static data because it contains ordered observations over time.

Formal Definition

Xᵢ = [x¹ᵢ, x²ᵢ, x³ᵢ, ..., xᵀᵢ] where: - i: sample index - T: total number of time steps (sequence length) - xᵗᵢ ∈ ℝᵈ: feature vector at time t - d: number of features (dimensions)

📝 Understanding the Notation

Subscript i: Identifies which time series (e.g., patient #5, sensor #12)

Superscript t: Identifies the time step within that series

Example: x³₅ means "features at time step 3 of time series #5"

3.2 Anatomy of Time-Series Data

🚶 Concrete Example: Gait Analysis Data

Setup: Motion capture of a person walking for 2 seconds at 30 FPS

T = 60 time steps (frames)
d = 12 features per frame (4 joints × 3 coordinates each)
Features: [hip_x, hip_y, hip_z, knee_x, knee_y, knee_z, ankle_x, ankle_y, ankle_z, foot_x, foot_y, foot_z]

Data structure:

Time	hip_x	hip_y	hip_z	...	foot_z
t=1	0.45	0.92	0.15	...	0.02
t=2	0.46	0.93	0.16	...	0.03
...	...	...	...	...	...
t=60	0.52	0.89	0.18	...	0.01

Shape: (60, 12) — a matrix where each row is one time step

3.3 What Makes Time Series Special

🎯 Critical Properties of Time Series

1. Temporal Order Matters

Frame 5 → Frame 6 → Frame 7 represents physical reality. Reversing or shuffling this order creates meaningless data.

2. Temporal Dependencies

Current values depend on past values:

x^t depends on x^(t-1), x^(t-2), ..., x^1

Example: Your foot position at frame 10 is influenced by where it was at frame 9.

3. Patterns Evolve Over Time

The transition from normal to abnormal happens gradually:

Frames 1-20: Normal walking
Frames 21-40: Subtle asymmetry appears
Frames 41-60: Clear ASD gait pattern

4. Dynamics Matter

Not just position, but velocity and acceleration:

Position: Where the joint is
Velocity: How fast it's moving (x^t - x^(t-1))
Acceleration: How velocity changes

3.4 Challenges in Time-Series Classification

🎯 Key Challenges

Variable Length

Different sequences may have different lengths (some walks are longer than others). Solutions: padding, truncation, or subsequencing.

Temporal Misalignment

Similar patterns may occur at different time offsets. One person's gait cycle might start at frame 5, another's at frame 15.

High Dimensionality

With T=60 and d=12, you have 720 features. This creates the "curse of dimensionality" problem.

Class Imbalance

In medical/industrial applications, abnormal cases are rare. This is where T-SMOTE comes in!

3.5 Why Standard Methods Fail

Visualization: What Happens When You Flatten Time Series

Original time series (meaningful):

Time →

Frame 1
[12 features]

Frame 2
[12 features]

Frame 3
[12 features]

...

Frame 60
[12 features]

After flattening for SMOTE (order lost):

[f₁₁, f₁₂, ..., f₁₁₂, f₂₁, f₂₂, ..., f₂₁₂, ..., f₆₀₁, ..., f₆₀₁₂]

Now it's just a 720-dimensional vector. The model has no way to know that f₁₁ (hip_x at time 1) should be close to f₂₁ (hip_x at time 2).

The Core Insight: Time series are not just collections of features—they're trajectories through feature space over time. Any augmentation technique must preserve this trajectory structure.

📏 4. The Leading Time Concept

The leading time is T-SMOTE's most innovative concept. It captures the idea that for classification tasks with temporal events (like failure, disease onset, or pattern occurrence), the most informative samples are those that capture the transition period—not just the final state.

4.1 Mathematical Definition

Leading Time (l): The temporal offset from the end of a sequence. It determines how far back in time we extract a subsequence.
Subsequence with Leading Time l: X⁽ˡ⁾ᵢ = [xᵢ^(T-l-w+1), xᵢ^(T-l-w+2), ..., xᵢ^(T-l)] where: - T: total sequence length - w: window size (subsequence length) - l: leading time (0, 1, 2, ..., L)

Intuitive Understanding

Think of leading time as "rewinding" the sequence:

l=0: The most recent w frames (ending at time T)
l=1: One step earlier (ending at time T-1)
l=2: Two steps earlier (ending at time T-2)
And so on...

🤔 Why Do We Need Leading Time?

The Problem with Just Using Final Frames

If you only look at the last window (l=0) for all positive samples:

All samples are deep in the positive region
Model doesn't learn the transition from negative → positive
Can't detect early-stage patterns
Poor performance on borderline cases

What Leading Time Achieves

By creating subsequences at different leading times:

l=0: Captures fully developed positive pattern (high confidence)
l=3: Captures mid-stage pattern (medium confidence)
l=7: Captures early-stage pattern (low confidence, near boundary)

This gives the model examples of how patterns evolve, not just their final state.

4.2 Visual Demonstration

Example: 10-Frame Sequence with Window Size w=5

Complete original sequence:

Frames:

1

2

3

4

5

6

7

8

9

10

Opacity represents pattern strength: darker = more obvious ASD pattern

X⁽⁰⁾ (l=0): Last 5 frames [6,7,8,9,10]

Extracted:

1-5

6

7

8

9

10

Model confidence: s⁽⁰⁾ = 0.95 (very confident this is ASD)

Meaning: Clear, fully-developed ASD gait pattern

X⁽¹⁾ (l=1): Frames [5,6,7,8,9]

Extracted:

1-4

5

6

7

8

9

10

Model confidence: s⁽¹⁾ = 0.78 (fairly confident)

Meaning: Pattern is developing but not fully established

X⁽²⁾ (l=2): Frames [4,5,6,7,8]

Extracted:

1-3

4

5

6

7

8

9-10

Model confidence: s⁽²⁾ = 0.54 (uncertain, borderline)

Meaning: Transition phase—could be ASD or normal

X⁽³⁾ (l=3): Frames [3,4,5,6,7]

Extracted:

1-2

3

4

5

6

7

8-10

Model confidence: s⁽³⁾ = 0.32 (looks more normal)

Meaning: Early stage, before pattern fully emerges

The Magic of Leading Time: By generating subsequences at different leading times, we create a temporal spectrum from "clearly positive" to "borderline" to "almost negative." This teaches the model to recognize patterns at all stages of development.

4.3 Calculating Leading Time Indices

Step-by-Step Calculation

Given:

Total sequence length: T = 10
Window size: w = 5
Leading time: l = 2

Calculate start and end indices:

Start index: T - l - w + 1 = 10 - 2 - 5 + 1 = 4 End index: T - l = 10 - 2 = 8 Therefore: X⁽²⁾ = [x₄, x₅, x₆, x₇, x₈]

Verify:

Length = 8 - 4 + 1 = 5 ✓ (matches window size)
Ends at T-l = 8 ✓ (two steps before the end)

4.4 Choosing Maximum Leading Time (L)

📐 Practical Guidelines

Maximum Possible Value

L_max = T - w

This ensures you don't try to extract a subsequence that starts before the beginning of the series.

Recommended Range

Typically, L is set to capture the meaningful transition period:

Short sequences (T < 50): L = 3 to 5
Medium sequences (50 ≤ T ≤ 200): L = 5 to 10
Long sequences (T > 200): L = 10 to 20

Domain Knowledge Matters

For gait analysis: If a gait cycle is ~30 frames and the transition takes ~15 frames, set L ≈ 15

For equipment failure: If failure indicators appear 100 time steps before failure, set L ≈ 100

4.5 Why Not Just Use the Entire Sequence?

❌ Using Full Sequence

Very long input to model
Computational cost increases
Early frames may be irrelevant noise
Harder to learn what's important
Doesn't focus on transition period

✅ Using Subsequences with Leading Time

Fixed-length windows (easier to batch)
Computationally efficient
Focuses on relevant time period
Multiple training examples from one sequence
Captures temporal evolution

🎯 5. Model Confidence Scores

Model confidence scores are the bridge between the temporal subsequences and the synthetic sample generation. They tell us how "positive" each subsequence looks, which guides how we mix them.

5.1 What is Model Confidence?

Model Confidence Score (s⁽ˡ⁾ᵢ): The predicted probability that subsequence X⁽ˡ⁾ᵢ belongs to the positive (minority) class, as output by a trained classifier.
Mathematical Form: s⁽ˡ⁾ᵢ = f(X⁽ˡ⁾ᵢ) ∈ [0, 1] where: - f: trained classifier (LSTM, CNN, etc.) - s⁽ˡ⁾ᵢ: confidence score - Range: 0 (definitely negative) to 1 (definitely positive)

🤔 Why Do We Need Confidence Scores?

Problem: Not All Subsequences Are Equally Useful

Consider two subsequences from an ASD gait sequence:

X⁽⁰⁾: Last 5 frames — clearly shows ASD pattern
X⁽⁸⁾: Very early frames — looks completely normal

If we mix these randomly with equal weight, we might generate samples that are too normal to be useful, or too mixed to be realistic.

Solution: Use Model's Own Assessment

The trained model can tell us:

Which subsequences are "deep" in the positive region (high confidence)
Which are borderline (medium confidence)
Which look negative (low confidence)

We use this information to intelligently guide the mixing process.

5.2 Computing Confidence Scores

Step-by-Step Process

Train a Base Classifier

First, train an initial classifier on your imbalanced dataset (before applying T-SMOTE). This can be:

LSTM (for sequential dependencies)
1D CNN (for local patterns)
Transformer (for long-range dependencies)
Any model that outputs probabilities

Note: This doesn't need to be perfect—it just needs to provide reasonable confidence estimates.

Generate Subsequences

For each positive sample in your dataset, create subsequences at different leading times:

X⁽⁰⁾ᵢ, X⁽¹⁾ᵢ, X⁽²⁾ᵢ, ..., X⁽ᴸ⁾ᵢ

Run Through Classifier

Pass each subsequence through the trained model to get probability outputs:

Python Example

# Assuming you have a trained Keras/PyTorch model for l in range(L+1): subseq = extract_subsequence(X_i, l, window_size) confidence = model.predict_proba(subseq)[0, 1] # Prob of positive class scores[l] = confidence

Store and Use

Store these confidence scores—you'll use them to:

Determine mixing weights (via Beta distribution)
Calculate synthetic sample confidences
Filter unreliable samples (via weighted sampling)

5.3 Interpreting Confidence Scores

Score Range	Interpretation	Position in Feature Space	Usefulness for Training
0.9 - 1.0	Very high confidence positive	Deep in positive region	Good for establishing class center
0.7 - 0.9	Confident positive	Solidly in positive region	Most useful for training
0.5 - 0.7	Likely positive	Approaching decision boundary	Critical for learning boundaries
0.3 - 0.5	Uncertain/borderline	Near or on decision boundary	Handle carefully—may be mislabeled
0.0 - 0.3	Looks negative	In negative region	Likely mislabeled or very early stage

5.4 The Confidence Progression Pattern

📊 Typical Pattern for an ASD Gait Sequence

Sequence Length: T = 60 frames, Window: w = 20 frames

Leading Time (l)	Frames Used	Confidence (s⁽ˡ⁾)	Description
0	[41-60]	0.94	Clear ASD pattern established
5	[36-55]	0.88	Pattern visible but less pronounced
10	[31-50]	0.76	Transitional phase beginning
15	[26-45]	0.58	Subtle abnormalities emerging
20	[21-40]	0.42	Mostly normal with hints
25	[16-35]	0.31	Appears normal

Key Observation: Confidence scores decrease as we go back in time, showing the gradual evolution from normal to ASD gait.

Why This Matters: The confidence scores encode the model's understanding of how patterns develop over time. By using these scores to guide synthesis, we ensure synthetic samples respect the natural progression of the pattern.

5.5 Edge Cases and Considerations

⚠️ Common Pitfalls

1. Poor Base Classifier

Problem: If your initial classifier is terrible (random guessing), confidence scores will be meaningless.

Solution: Ensure your base classifier achieves at least moderate performance (e.g., AUC > 0.6) before using T-SMOTE.

2. All High Confidences

Problem: If all subsequences have confidence > 0.9, you're not capturing the transition.

Solution: Increase maximum leading time L to go further back in time.

3. All Low Confidences

Problem: If all scores are < 0.5, the sample might be mislabeled.

Solution: Review labels or exclude this sample from augmentation.

4. Non-Monotonic Progression

Problem: Sometimes scores don't decrease smoothly (e.g., s⁽³⁾ > s⁽¹⁾).

Solution: This is normal due to noise. T-SMOTE is robust to small fluctuations.

💡 Pro Tip: Warm-Start Strategy

If your initial dataset is very imbalanced (e.g., 1:100), your base classifier might struggle. Try this approach:

Apply simple oversampling (duplication) to get to 1:10 ratio
Train a base classifier on this
Use this classifier to compute T-SMOTE confidence scores
Apply T-SMOTE to get to 1:1 ratio
Train final classifier on T-SMOTE augmented data

📊 6. Beta Distribution & Mixing Weights

The Beta distribution is the mathematical heart of T-SMOTE. It determines how to mix two temporal neighbors based on their confidence scores, ensuring synthetic samples are both diverse and realistic.

6.1 What is the Beta Distribution?

Beta Distribution: A continuous probability distribution defined on the interval [0, 1], parameterized by two shape parameters α and β (often denoted as a and b).
Mathematical Form: X ~ Beta(a, b) Probability Density Function: f(x; a, b) = (x^(a-1) * (1-x)^(b-1)) / B(a, b) where B(a,b) is the Beta function (normalization constant) Mean: E[X] = a / (a + b) Variance: Var[X] = (a*b) / ((a+b)²(a+b+1))

6.2 Why Beta Distribution?

🎯 Perfect Properties for Our Task

1. Bounded to [0,1]

For interpolation X_new = α·X⁽ˡ⁾ + (1-α)·X⁽ˡ⁺¹⁾, we need α ∈ [0,1]. Beta naturally lives in this range.

2. Flexible Shapes

By varying parameters a and b, Beta can be:

Uniform: Beta(1,1) → equal probability for all α
Left-skewed: Beta(0.3, 0.8) → favors small α
Right-skewed: Beta(0.8, 0.3) → favors large α
Bell-shaped: Beta(5, 5) → concentrated around 0.5

3. Natural Interpretation

When a = s⁽ˡ⁾ and b = s⁽ˡ⁺¹⁾:

If s⁽ˡ⁾ > s⁽ˡ⁺¹⁾: α tends toward 1 → more weight on recent (confident) subsequence
If s⁽ˡ⁾ < s⁽ˡ⁺¹⁾: α tends toward 0 → more weight on earlier subsequence
If s⁽ˡ⁾ ≈ s⁽ˡ⁺¹⁾: α around 0.5 → balanced mixing

4. Incorporates Model Knowledge

By using confidence scores as parameters, we let the model's own assessment guide the augmentation process.

6.3 T-SMOTE's Use of Beta

The Formula

α ~ Beta(s⁽ˡ⁾ᵢ, s⁽ˡ⁺¹⁾ᵢ) where: - s⁽ˡ⁾ᵢ: confidence of subsequence at leading time l - s⁽ˡ⁺¹⁾ᵢ: confidence of subsequence at leading time l+1 - α: sampled mixing weight

Intuitive Interpretation

Think of a and b as "votes" for which subsequence to favor:

a = s⁽ˡ⁾ = 0.9: 9 votes for X⁽ˡ⁾ (recent, confident)
b = s⁽ˡ⁺¹⁾ = 0.4: 4 votes for X⁽ˡ⁺¹⁾ (earlier, less confident)
Expected α: 0.9/(0.9+0.4) ≈ 0.69 → leans toward recent one

6.4 Visual Examples

Case 1: High vs Low Confidence

Parameters: Beta(0.90, 0.30)

Mean: 0.90/(0.90+0.30) = 0.75
Shape: Strongly right-skewed
Interpretation: Most samples will have α around 0.7-0.8, heavily favoring X⁽ˡ⁾

Distribution Shape:

Probability
    ▲
 1.5│                    ╱█
    │                   ╱ █
 1.0│                  ╱  █
    │                 ╱   █
 0.5│         ▁▁▁▁▁▁▁╱    █
    │  ▁▁▁▁▁▁▁             █
 0.0└────────────────────────────▶ α
    0   0.2  0.4  0.6  0.8  1.0

Result: Synthetic samples will be very similar to X⁽⁰⁾ (recent, confident)

Case 2: Similar Confidences

Parameters: Beta(0.65, 0.60)

Mean: 0.65/(0.65+0.60) = 0.52
Shape: Roughly symmetric
Interpretation: Balanced mixing with slight favor to X⁽ˡ⁾

Distribution Shape:

Probability
    ▲
 1.5│         ╱█╲
    │        ╱ █ ╲
 1.0│       ╱  █  ╲
    │      ╱   █   ╲
 0.5│  ▁▁▁╱    █    ╲▁▁▁
    │ ▁▁▁      █      ▁▁▁
 0.0└────────────────────────────▶ α
    0   0.2  0.4  0.6  0.8  1.0

Result: Diverse synthetic samples spanning both subsequences

6.5 Numerical Example

🔢 Step-by-Step Calculation

Setup:

Subsequence X⁽¹⁾: confidence s⁽¹⁾ = 0.84
Subsequence X⁽²⁾: confidence s⁽²⁾ = 0.58

Set Beta Parameters

a = s⁽¹⁾ = 0.84 b = s⁽²⁾ = 0.58

Calculate Expected Value

E[α] = a/(a+b) = 0.84/(0.84+0.58) = 0.84/1.42 ≈ 0.592

Interpretation: On average, synthetic samples will be 59.2% from X⁽¹⁾ and 40.8% from X⁽²⁾

Sample α (in practice)

Python Implementation

import numpy as np # Sample mixing weight alpha = np.random.beta(0.84, 0.58) # Example output: alpha = 0.627 print(f"Sampled α: {alpha:.3f}") # This specific sample: 62.7% from X⁽¹⁾, 37.3% from X⁽²⁾

Create Synthetic Sample

X_new = 0.627 × X⁽¹⁾ + 0.373 × X⁽²⁾

This preserves temporal structure while creating a slightly earlier version of the pattern.

6.6 Why Not Simpler Alternatives?

❌ Uniform Random (α ~ U[0,1])

Ignores confidence information
Treats all mixing equally likely
Could create unrealistic samples
No model guidance

Example: Might mix 90% confident with 30% confident subsequence using α=0.1, creating mostly negative-looking sample

✅ Beta Distribution

Incorporates confidence
Adaptively weights mixing
Creates realistic samples
Model-guided augmentation

Same scenario: Beta(0.9, 0.3) naturally produces α around 0.7-0.8, keeping samples positive

❌ Fixed α (e.g., α=0.5)

No diversity
All synthetics are identical
Overfitting risk
Doesn't explore space

✅ Sampled α from Beta

Natural diversity
Different synthetics each time
Better generalization
Explores around mean

The Brilliance of Beta: It automatically adjusts the mixing strategy based on how confident the model is about each subsequence. High confidence differences → skewed mixing (favor confident one). Similar confidences → balanced mixing (explore between them).

⚗️ 7. Synthesizing New Samples

Now we bring everything together: temporal subsequences, confidence scores, and Beta-sampled mixing weights combine to create synthetic time-series samples that are both temporally coherent and strategically positioned in feature space.

7.1 The Core Synthesis Formula

X_new = α · X⁽ˡ⁾ᵢ + (1-α) · X⁽ˡ⁺¹⁾ᵢ where: - X⁽ˡ⁾ᵢ, X⁽ˡ⁺¹⁾ᵢ: two consecutive temporal subsequences - α ~ Beta(s⁽ˡ⁾ᵢ, s⁽ˡ⁺¹⁾ᵢ): mixing weight - X_new: synthetic subsequence (w × d matrix) This operation is performed ELEMENT-WISE on all time steps and features.

📐 Element-Wise Operation

The interpolation happens for every single value in the matrices:

For each time step t ∈ [1, w] and feature f ∈ [1, d]: X_new[t, f] = α · X⁽ˡ⁾[t, f] + (1-α) · X⁽ˡ⁺¹⁾[t, f]

This ensures temporal coherence—we're not mixing different time steps!

7.2 Complete Example with Real Numbers

🔢 Full Worked Example

Scenario: 3-feature gait data, window size w=4

Step 1: Two Temporal Subsequences

X⁽⁰⁾ (recent, l=0, confidence = 0.84):

Time	hip_x	knee_x	ankle_x
1	1.6	0.9	2.5
2	1.8	1.0	2.7
3	2.0	1.1	2.9
4	2.2	1.3	3.0

X⁽¹⁾ (earlier, l=1, confidence = 0.58):

Time	hip_x	knee_x	ankle_x
1	1.3	0.8	2.3
2	1.6	0.9	2.5
3	1.8	1.0	2.7
4	2.0	1.1	2.9

Step 2: Sample Mixing Weight

α ~ Beta(0.84, 0.58) Let's say we sample: α = 0.59 Therefore: (1-α) = 0.41

Step 3: Compute Synthetic Sample (Element-by-Element)

Time step 1:
hip_x:   0.59×1.6 + 0.41×1.3 = 0.944 + 0.533 = 1.477
knee_x:  0.59×0.9 + 0.41×0.8 = 0.531 + 0.328 = 0.859
ankle_x: 0.59×2.5 + 0.41×2.3 = 1.475 + 0.943 = 2.418
Time step 2:
hip_x:   0.59×1.8 + 0.41×1.6 = 1.062 + 0.656 = 1.718
knee_x:  0.59×1.0 + 0.41×0.9 = 0.590 + 0.369 = 0.959
ankle_x: 0.59×2.7 + 0.41×2.5 = 1.593 + 1.025 = 2.618
Time step 3:
hip_x:   0.59×2.0 + 0.41×1.8 = 1.180 + 0.738 = 1.918
knee_x:  0.59×1.1 + 0.41×1.0 = 0.649 + 0.410 = 1.059
ankle_x: 0.59×2.9 + 0.41×2.7 = 1.711 + 1.107 = 2.818
Time step 4:
hip_x:   0.59×2.2 + 0.41×2.0 = 1.298 + 0.820 = 2.118
knee_x:  0.59×1.3 + 0.41×1.1 = 0.767 + 0.451 = 1.218
ankle_x: 0.59×3.0 + 0.41×2.9 = 1.770 + 1.189 = 2.959

Step 4: Final Synthetic Sample (X_new)

Time	hip_x	knee_x	ankle_x
1	1.477	0.859	2.418
2	1.718	0.959	2.618
3	1.918	1.059	2.818
4	2.118	1.218	2.959

✅ Verification:

All values lie between X⁽⁰⁾ and X⁽¹⁾ ✓
Temporal progression is smooth ✓
Closer to X⁽⁰⁾ (since α=0.59) ✓
Physically plausible joint positions ✓

7.3 Synthetic Label Confidence

Along with the synthetic sequence, we also compute its expected confidence:

s_new = α · s⁽ˡ⁾ᵢ + (1-α) · s⁽ˡ⁺¹⁾ᵢ

Continuing Our Example:

s_new = 0.59 × 0.84 + 0.41 × 0.58 = 0.4956 + 0.2378 = 0.7334

Interpretation: The synthetic sample is expected to have ~73% confidence of being ASD—still positive, but closer to the decision boundary than X⁽⁰⁾.

7.4 Why This Works: The Geometry

Geometric Interpretation in Feature Space

Imagine plotting confidence scores along a temporal axis:

Confidence
    1.0│                         ● X⁽⁰⁾ (0.84)
       │                        ╱
    0.8│                    ★  ← X_new (0.73)
       │                   ╱
    0.6│              ● X⁽¹⁾ (0.58)
       │             ╱
    0.4│        ● X⁽²⁾
       │       ╱
    0.2│  ● X⁽³⁾
       │
    0.0└─────────────────────────────────▶ Time
       earlier                       recent

Key Points:

Synthetic sample (★

📑 Table of Contents

🎯 1. Introduction & Motivation

1.1 The Class Imbalance Problem

🤔 Why Do We Need T-SMOTE?

Medical Diagnosis

Equipment Failure Prediction

Financial Fraud Detection

Why Traditional Methods Fail

1.2 Core Innovation of T-SMOTE

1.3 Key Terminology

1.4 When to Use T-SMOTE

✅ Ideal Use Cases

⚠️ Not Recommended When

⚠️ 2. The Problem with Standard SMOTE

2.1 How Standard SMOTE Works

Mathematical Formulation

Step-by-Step Process

Standard SMOTE Algorithm

📊 Concrete Example: Credit Scoring

2.2 Why SMOTE Fails for Time Series

🚫 Critical Failure Modes

Problem 1: Temporal Order Destruction

Problem 2: Unrealistic Temporal Mixing

Problem 3: Loss of Dynamics

🎭 Illustrative Example: Gait Analysis

2.3 Comparison: What Works vs. What Doesn't

✅ SMOTE Works Great For:

❌ SMOTE Fails For:

📚 Historical Context

⏰ 3. Understanding Time-Series Structure

3.1 Mathematical Representation

Formal Definition

📝 Understanding the Notation

3.2 Anatomy of Time-Series Data

🚶 Concrete Example: Gait Analysis Data

3.3 What Makes Time Series Special

🎯 Critical Properties of Time Series

1. Temporal Order Matters

2. Temporal Dependencies

3. Patterns Evolve Over Time

4. Dynamics Matter

3.4 Challenges in Time-Series Classification

🎯 Key Challenges

Variable Length

Temporal Misalignment

High Dimensionality

Class Imbalance

3.5 Why Standard Methods Fail

Visualization: What Happens When You Flatten Time Series

📏 4. The Leading Time Concept

4.1 Mathematical Definition

Intuitive Understanding

🤔 Why Do We Need Leading Time?

The Problem with Just Using Final Frames

What Leading Time Achieves

4.2 Visual Demonstration

Example: 10-Frame Sequence with Window Size w=5

4.3 Calculating Leading Time Indices

Step-by-Step Calculation

4.4 Choosing Maximum Leading Time (L)

📐 Practical Guidelines

Maximum Possible Value

Recommended Range

Domain Knowledge Matters

4.5 Why Not Just Use the Entire Sequence?

❌ Using Full Sequence

✅ Using Subsequences with Leading Time

🎯 5. Model Confidence Scores

5.1 What is Model Confidence?

🤔 Why Do We Need Confidence Scores?

Problem: Not All Subsequences Are Equally Useful

Solution: Use Model's Own Assessment

5.2 Computing Confidence Scores

Step-by-Step Process

Train a Base Classifier

Generate Subsequences

Run Through Classifier

Store and Use

5.3 Interpreting Confidence Scores

5.4 The Confidence Progression Pattern