Time Series Segmentation & Embedding — Notes
Clear, step-by-step notes that combine concepts, math, examples, and a pipeline diagram. Designed to be read on screen or printed. Reference: Irani et al., "Time Series Embedding Methods for Classification Tasks: A Review" (Expert Systems, 2025).
1. What this document covers
- Definitions and intuition: segmentation, embedding, aggregation
- The end-to-end pipeline: math + flow
- Worked numeric example
- Common embedding methods and their interpretations
- Tips: hyperparameters, pitfalls, evaluation
2. Time Series Segmentation — Intuition
Segmentation is the process of splitting a long time series into multiple fixed-length windows (segments) that serve as individual training samples. Each segment is processed independently by the embedding function and the classifier. Segmentation helps the model learn local temporal structure rather than trying to model very long sequences directly.
Key hyperparameters
- Window size (τ): number of time steps in each segment.
- Overlap (ω): how many time steps the next window shares with the previous window.
- Stride: τ − ω, the number of new samples introduced by each new segment.
Why we segment
- Transforms variable-length series into many fixed-size examples.
- Boosts training set size by creating multiple overlapping views.
- Allows local patterns to be learned more efficiently (e.g., a heartbeat morphology in a fixed window).
Visual idea (conceptual)
3. Segment Labels: Where they come from and why aggregation?
When the original dataset has a label at each timestep (e.g., activity at every second), a single segment contains many such labels. To produce a single target value for the whole segment we use an aggregation function.
Common aggregation choices
- Mode (most common label) — used for categorical labels (the paper uses this).
- Majority threshold — choose a class only if it appears in more than a threshold fraction (e.g., >50%).
- Proportional labels — for multi-label problems, store label distribution inside the window.
Why mode is sensible
- Simple, robust for categorical labels.
- Avoids averaging non-sense for categorical values.
- Represents the "dominant" state in the time window.
Where aggregated labels are used
The aggregated label yi,j is the target used during classification training. For each segment:
segment: s_{i,j} -> embedding: v_{i,j} -> classifier input
label: y_{i,j} -> used as the target to compute the loss during training
4. Normalization & Preprocessing
Normalization ensures that channels and segments are comparable. Two common approaches:
- Standardization: subtract channel mean and divide by channel standard deviation (computed on the training set)
- Min-max scaling: rescale to [0,1] using training min and max
Notation — for a channel c in the training set:
standard: s_tilde[t,c] = (s[t,c] - mu_c) / sigma_c
min-max: s_tilde[t,c] = (s[t,c] - min_c) / (max_c - min_c)
Always compute statistics only on training segments and apply the same transformation to validation/test segments.
5. What is an Embedding?
An embedding is a fixed-length vector that summarizes a segment. Formally, for a segment s (shape τ × C):
v = g(s) where v ∈ ℝᵈ
Here ℝᵈ simply means a vector with d real numbers (e.g., d=64). The embedding function g(·) can be:
- Hand-crafted: mean, std, spectral power, peak counts
- Transformation-based: DFT/FFT, wavelet coefficients
- Statistical: PCA projection
- Model-based / Learned: autoencoders, CNN/RNN/Transformer encoders
- Topological/Graph: features from visibility graphs or persistence diagrams
Why embeddings help
- Convert variable-length or long inputs into a compact, fixed-size form.
- Make it easy to use classical classifiers (SVM, Random Forest, Logistic Regression).
- Enable similarity search and visualization in embedding space.
Examples of simple g(·)
g_meanvar(s) = [ mean(s), var(s) ] // a 2-d embedding
g_fft(s) = [ |FFT_k| for k in selected frequencies ]
encoderNN(s) = last_hidden_state_from_cnn_or_transformer // learned d-dim vector
6. Classifiers & Training
Once you have embeddings vi,j and labels yi,j, you train a classifier fθ to map embeddings → labels:
ŷ = f_θ(v) where ŷ is the predicted label (or probability)
Common classifier choices
- Logistic Regression / Linear models
- Random Forest / XGBoost
- KNN / SVM
- MLP (dense neural network)
Loss & optimization (high level)
- For binary classification: binary cross-entropy
- For multi-class: categorical cross-entropy (softmax)
- Optimization via gradient descent (SGD, Adam) for neural nets; tree ensembles use different fitting algorithms
Where segment labels are used
The aggregated segment label yi,j is directly used in the loss computation during training. The model updates its parameters θ to reduce the discrepancy between ŷ and y.
7. Formal pipeline — mathematics
8. Worked numerical example (step-by-step)
We use a small example to see the pipeline in numbers. Let:
X = [2,3,5,6,7,6,5,4,3,2]
Y = [0,0,0,0,1,1,1,1,0,0]
τ = 5, ω = 2 (so stride = 3)
Segments
- s1 = [2,3,5,6,7], labels [0,0,0,0,1] → mode = 0
- s2 = [6,7,6,5,4], labels [0,0,0,1,1] → mode = 0
- s3 = [5,4,3,2,?] — (trim or pad) → aggregate labels accordingly
Embedding (simple mean-std)
g(s) = [ mean(s), std(s) ]
For s1: mean=4.6, std≈1.85 → v1=[4.6,1.85]
For s2: mean=5.6, std≈1.02 → v2=[5.6,1.02]
Train a simple classifier
Using logistic regression: ŷ = σ(w·v + b)
Use segments (v1,y1), (v2,y2), ... to fit weights w and bias b via gradient descent.
Note: In real datasets you will have many more overlapping segments and a richer embedding (d>2).
9. Practical tips and common pitfalls
Hyperparameters
- τ (window size): pick based on the temporal scale of the phenomenon (e.g., one gait cycle, one sleep epoch).
- ω (overlap): higher overlap increases data size and context but also correlation between samples — be mindful of train/val splits.
- Embedding dimension d: tuned per method; too small loses information, too large invites overfitting.
Data leakage risks
- Ensure splits are made at the entity level (subject, device, date) so that windows from the same entity do not appear in both train and test.
- When computing normalization stats, compute them only on training data.
Label noise & mixed windows
- Short windows reduce the chance of mixed labels; long windows increase label ambiguity.
- Consider rejecting windows that do not have a clear majority label or using soft targets that reflect label distribution.
Evaluation
- Report per-segment metrics and, where relevant, aggregated sequence metrics (e.g., majority vote across windows to predict a full sequence).
- Use balanced metrics (F1, balanced accuracy) when classes are imbalanced.
10. References
Main reference used to build these notes:
- Irani, H., Ghahremani, Y., Kermani, A., & Metsis, V. (2025). Time Series Embedding Methods for Classification Tasks: A Review. Expert Systems. DOI: 10.1111/exsy.70148.
Further reading (selected): PCA, FFT/Wavelets textbooks, TSFRESH/catch22 docs, and contrastive learning literature (e.g., NNCLR).