Time Series Segmentation & Embedding

1. What this document covers

Definitions and intuition: segmentation, embedding, aggregation
The end-to-end pipeline: math + flow
Worked numeric example
Common embedding methods and their interpretations
Tips: hyperparameters, pitfalls, evaluation

2. Time Series Segmentation — Intuition

Segmentation is the process of splitting a long time series into multiple fixed-length windows (segments) that serve as individual training samples. Each segment is processed independently by the embedding function and the classifier. Segmentation helps the model learn local temporal structure rather than trying to model very long sequences directly.

Key hyperparameters

Window size (τ): number of time steps in each segment.
Overlap (ω): how many time steps the next window shares with the previous window.
Stride: τ − ω, the number of new samples introduced by each new segment.

Why we segment

Transforms variable-length series into many fixed-size examples.
Boosts training set size by creating multiple overlapping views.
Allows local patterns to be learned more efficiently (e.g., a heartbeat morphology in a fixed window).

Visual idea (conceptual)

Quick note: Segmentation is the point where the series changes from a time-ordered signal into a set of independent samples that models can learn from.

3. Segment Labels: Where they come from and why aggregation?

When the original dataset has a label at each timestep (e.g., activity at every second), a single segment contains many such labels. To produce a single target value for the whole segment we use an aggregation function.

Common aggregation choices

Mode (most common label) — used for categorical labels (the paper uses this).
Majority threshold — choose a class only if it appears in more than a threshold fraction (e.g., >50%).
Proportional labels — for multi-label problems, store label distribution inside the window.

Why mode is sensible

Simple, robust for categorical labels.
Avoids averaging non-sense for categorical values.
Represents the "dominant" state in the time window.

Where aggregated labels are used

The aggregated label y_i,j is the target used during classification training. For each segment:

segment: s_{i,j}  ->  embedding: v_{i,j}  ->  classifier input
label:   y_{i,j}  ->  used as the target to compute the loss during training

4. Normalization & Preprocessing

Normalization ensures that channels and segments are comparable. Two common approaches:

Standardization: subtract channel mean and divide by channel standard deviation (computed on the training set)
Min-max scaling: rescale to [0,1] using training min and max

Notation — for a channel c in the training set:

standard:  s_tilde[t,c] = (s[t,c] - mu_c) / sigma_c
min-max:    s_tilde[t,c] = (s[t,c] - min_c) / (max_c - min_c)

Always compute statistics only on training segments and apply the same transformation to validation/test segments.

5. What is an Embedding?

An embedding is a fixed-length vector that summarizes a segment. Formally, for a segment s (shape τ × C):

v = g(s)   where   v ∈ ℝᵈ

Here ℝᵈ simply means a vector with d real numbers (e.g., d=64). The embedding function g(·) can be:

Hand-crafted: mean, std, spectral power, peak counts
Transformation-based: DFT/FFT, wavelet coefficients
Statistical: PCA projection
Model-based / Learned: autoencoders, CNN/RNN/Transformer encoders
Topological/Graph: features from visibility graphs or persistence diagrams

Why embeddings help

Convert variable-length or long inputs into a compact, fixed-size form.
Make it easy to use classical classifiers (SVM, Random Forest, Logistic Regression).
Enable similarity search and visualization in embedding space.

Examples of simple g(·)

g_meanvar(s) = [ mean(s), var(s) ]  // a 2-d embedding
g_fft(s) = [ |FFT_k| for k in selected frequencies ]
encoderNN(s) = last_hidden_state_from_cnn_or_transformer  // learned d-dim vector

6. Classifiers & Training

Once you have embeddings v_i,j and labels y_i,j, you train a classifier f_θ to map embeddings → labels:

ŷ = f_θ(v)   where  ŷ is the predicted label (or probability)

Common classifier choices

Logistic Regression / Linear models
Random Forest / XGBoost
KNN / SVM
MLP (dense neural network)

Loss & optimization (high level)

For binary classification: binary cross-entropy
For multi-class: categorical cross-entropy (softmax)
Optimization via gradient descent (SGD, Adam) for neural nets; tree ensembles use different fitting algorithms

Where segment labels are used

The aggregated segment label y_i,j is directly used in the loss computation during training. The model updates its parameters θ to reduce the discrepancy between ŷ and y.

7. Formal pipeline — mathematics

Dataset: 𝓓 = { (Xᵢ, Yᵢ) }ᵢ₌₁ᴺ, where Xᵢ ∈ ℝ^{Tᵢ×C}, Yᵢ ∈ 𝓛^{Tᵢ} Segmentation: for each series Xᵢ, create windows sᵢⱼ = Xᵢ[tⱼ : tⱼ + τ] where tⱼ = 1, (τ−ω)+1, 2(τ−ω)+1, …, Tᵢ − τ + 1 Label aggregation: yᵢⱼ = mode( Yᵢ[tⱼ : tⱼ + τ] ) ⟶ categorical labels Normalization: s̃ᵢⱼ = f( sᵢⱼ ), where f is standardization or min–max scaling using training statistics Embedding: vᵢⱼ = g( s̃ᵢⱼ ) ∈ ℝᵈ Classifier: learn f_θ with parameters θ such that ŷᵢⱼ = f_θ(vᵢⱼ) Training objective: minimize 𝓛(θ) = Σᵢⱼ ℓ( yᵢⱼ , ŷᵢⱼ )

8. Worked numerical example (step-by-step)

We use a small example to see the pipeline in numbers. Let:

X = [2,3,5,6,7,6,5,4,3,2]
Y = [0,0,0,0,1,1,1,1,0,0]
τ = 5, ω = 2 (so stride = 3)

Segments

s1 = [2,3,5,6,7], labels [0,0,0,0,1] → mode = 0
s2 = [6,7,6,5,4], labels [0,0,0,1,1] → mode = 0
s3 = [5,4,3,2,?] — (trim or pad) → aggregate labels accordingly

Embedding (simple mean-std)

g(s) = [ mean(s), std(s) ]
For s1: mean=4.6, std≈1.85  → v1=[4.6,1.85]
For s2: mean=5.6, std≈1.02  → v2=[5.6,1.02]

Train a simple classifier

Using logistic regression: ŷ = σ(w·v + b)
Use segments (v1,y1), (v2,y2), ... to fit weights w and bias b via gradient descent.

Note: In real datasets you will have many more overlapping segments and a richer embedding (d>2).

9. Practical tips and common pitfalls

Hyperparameters

τ (window size): pick based on the temporal scale of the phenomenon (e.g., one gait cycle, one sleep epoch).
ω (overlap): higher overlap increases data size and context but also correlation between samples — be mindful of train/val splits.
Embedding dimension d: tuned per method; too small loses information, too large invites overfitting.

Data leakage risks

Ensure splits are made at the entity level (subject, device, date) so that windows from the same entity do not appear in both train and test.
When computing normalization stats, compute them only on training data.

Label noise & mixed windows

Short windows reduce the chance of mixed labels; long windows increase label ambiguity.
Consider rejecting windows that do not have a clear majority label or using soft targets that reflect label distribution.

Evaluation

Report per-segment metrics and, where relevant, aggregated sequence metrics (e.g., majority vote across windows to predict a full sequence).
Use balanced metrics (F1, balanced accuracy) when classes are imbalanced.

10. References

Main reference used to build these notes:

Irani, H., Ghahremani, Y., Kermani, A., & Metsis, V. (2025). Time Series Embedding Methods for Classification Tasks: A Review. Expert Systems. DOI: 10.1111/exsy.70148.

Further reading (selected): PCA, FFT/Wavelets textbooks, TSFRESH/catch22 docs, and contrastive learning literature (e.g., NNCLR).