DisentangledTMR | Privacy-Preserving Skeleton Motion Retargeting

Project Overview

DisentangledTMR is a privacy-preserving skeleton motion retargeting system that uses explicit architectural disentanglement to separate action dynamics from skeletal identity.

The Privacy Problem

Why skeleton data needs protection

The Threat

Skeleton motion data, despite lacking visual appearance, contains rich biometric signatures that enable person re-identification with >80% accuracy.

📏 Static cues: Bone lengths, limb ratios, body proportions
🏃 Dynamic cues: Gait patterns, movement style, posture habits
🔗 Linkage attacks: Cross-session tracking without labels

Our Solution

Transfer motion dynamics to a target skeleton while provably removing identity information through explicit architectural disentanglement.

Key Insight: By architecturally separating action and identity encoders, we ensure the action representation cannot leak identity information by construction.

Disentanglement Criteria

Four properties that define successful disentanglement

Action Retention

$\mathbf{H}_{\text{action}}$ must preserve action class information

Identity Removal

$\mathbf{H}_{\text{action}}$ must not leak source identity

Identity Capture

$\mathbf{H}_{\text{identity}}$ must encode target skeleton structure

⊥

Statistical Independence

$\mathbf{H}_{\text{action}} \perp \mathbf{H}_{\text{identity}}$

System Pipeline

End-to-end motion retargeting flow

flowchart LR
    subgraph Input
        S1[Source Motion
Person A]
        S2[Target Skeleton
Person B]
    end

    subgraph Encoders
        AE[Action Encoder
Temporal Conv + LSTM + Attn]
        IE[Identity Encoder
Spatial GCN + Attn]
    end

    subgraph Decoder
        FD[Factorized Decoder
Cross-Attention Fusion]
    end

    subgraph Output
        OUT[Retargeted Motion
Action A on Body B]
    end

    S1 --> AE
    S2 --> IE
    AE --> FD
    IE --> FD
    FD --> OUT

    style AE fill:#6366f1,stroke:#6366f1,color:#fff
    style IE fill:#10b981,stroke:#10b981,color:#fff
    style FD fill:#f59e0b,stroke:#f59e0b,color:#000

src/model/disentangled_tmr.py → DisentangledTMR

Datasets

Evaluation benchmarks for skeleton-based action recognition

Dataset	Subjects	Actions	Samples	Joints	Frames	Protocol
NTU RGB+D 60	40	60	56,880	25	64	Cross-Subject / Cross-View
NTU RGB+D 120	106	120	114,480	25	64	Cross-Subject / Cross-Setup
ETRI-Activity3D	100	55	112,620	25	64	Cross-Subject

All datasets use the same 25-joint skeleton topology (Kinect v2 format) with sequences normalized to T=64 frames via linear interpolation.

Model Architecture

Three specialized components work together: Action Encoder extracts temporal dynamics, Identity Encoder captures skeletal structure, and Factorized Decoder fuses them for retargeting.

Action Path

Action Encoder

Extracts identity-invariant temporal dynamics from source motion

flowchart LR
    subgraph Input
        IN[Velocity + Acceleration]
    end

    subgraph CoreLayers
        CONV[Temporal Convs]
        ATT[Multi-Head Attention]
        LSTM[Bi-LSTM]
    end

    subgraph Backbone
        MIX[Action Recognition Architecture]
    end

    subgraph Fusion
        GATE[Gating]
    end

    subgraph Output
        H[Action Embedding]
    end

    IN --> CONV --> ATT --> LSTM
    LSTM --> GATE
    MIX --> GATE
    GATE --> H

    style CONV fill:#6366f1,stroke:#6366f1,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style MIX fill:#ec4899,stroke:#ec4899,color:#fff
    style GATE fill:#10b981,stroke:#10b981,color:#fff

Key Components

Velocity/Acceleration: Removes static pose information, focuses on dynamics
Multi-scale Conv: Captures short (k=3), medium (k=5), and long (k=7) temporal patterns
Temporal Attention: Models long-range dependencies between frames
Bidirectional LSTM: Sequential modeling with forward/backward context

Architecture Equations

Velocity: $\mathbf{V}_t = \mathbf{s}_t - \mathbf{s}_{t-1}, \quad t \in [2, T]$

Acceleration: $\mathbf{A}_t = \mathbf{V}_t - \mathbf{V}_{t-1}, \quad t \in [3, T]$

Input: $\mathbf{X} = [\mathbf{s}; \mathbf{V}; \mathbf{A}] \in \mathbb{R}^{B \times 9 \times T \times V}$

Conv: $\mathbf{H}^{(k)} = \text{Conv1D}_k(\mathbf{X}), \quad k \in \{3, 5, 7\}$

src/model/action_encoder.py → ActionEncoder

Identity Path

Identity Encoder

Extracts static skeletal structure from target skeleton

flowchart LR
    subgraph Input
        IN[Static Pose + Bones]
    end

    subgraph Processing
        GCN[Spatial GCN]
        SA[Spatial Attention]
        POOL[Global Avg Pool]
    end

    subgraph Output
        H[Identity Embedding]
    end

    IN --> GCN --> SA --> POOL --> H

    style GCN fill:#10b981,stroke:#10b981,color:#fff
    style SA fill:#06b6d4,stroke:#06b6d4,color:#fff
    style POOL fill:#8b5cf6,stroke:#8b5cf6,color:#fff

Key Components

Static Pose: Temporal mean removes action-specific dynamics
Bone Lengths: 24 bone vectors computed from joint pairs
Spatial GCN: 3-layer graph convolution on skeleton topology
Global Pooling: Aggregates joint features into single identity vector

Architecture Equations

$$\bar{\mathbf{x}} = \frac{1}{T}\sum_{t=1}^{T}\mathbf{x}_t$$

$$\mathbf{b}_e = \bar{\mathbf{x}}_{j_1(e)} - \bar{\mathbf{x}}_{j_2(e)}, \quad e \in \mathcal{E}$$

$$\mathbf{H}_{\text{identity}} = \text{Pool}(\text{Attn}(\text{GCN}^{(3)}(\mathbf{B})))$$

src/model/identity_encoder.py → IdentityEncoder

Fusion

Factorized Decoder

Fuses action and identity representations via cross-attention

flowchart LR
    subgraph Inputs
        FN[Frame n]
        AE[Action Encoder]
        IE[Identity Encoder]
    end

    subgraph DecoderLayer[Decoder Layer]
        SA[Self-Attention]
        EDA[Encoder-Decoder Attn]
        ST[Style Transfer]
        FFN[FFN]
    end

    subgraph Output
        FN1[Frame n+1]
    end

    FN --> SA --> EDA
    AE --> EDA
    EDA --> ST
    IE --> ST
    ST --> FFN --> FN1
    FN1 -->|Loop| FN

    style SA fill:#f59e0b,stroke:#f59e0b,color:#000
    style EDA fill:#6366f1,stroke:#6366f1,color:#fff
    style ST fill:#10b981,stroke:#10b981,color:#fff
    style FFN fill:#8b5cf6,stroke:#8b5cf6,color:#fff

Key Components

Causal Self-Attention: Autoregressive generation with masked attention
Separate Cross-Attention: Independent attention to action and identity
Adaptive Fusion: Learned gating between action and identity contributions
Bone Correction: Post-hoc adjustment to match target bone lengths

Architecture Equations

$$\mathbf{Q} = \mathbf{H}\mathbf{W}_Q, \quad \mathbf{K}_a = \mathbf{H}_{\text{action}}\mathbf{W}_K^a$$

$$\mathbf{F} = \alpha \cdot \text{XAttn}(\mathbf{Q}, \mathbf{H}_{\text{action}}) + (1-\alpha) \cdot \text{XAttn}(\mathbf{Q}, \mathbf{H}_{\text{identity}})$$

$$\hat{\mathbf{x}} = \text{BoneCorrect}(\text{FFN}(\mathbf{F}), \mathbf{B}^{\text{tgt}})$$

src/model/factorized_decoder.py → FactorizedDecoder

Model Hyperparameters

Default configuration from main_config.yaml

Component	Parameter	Value	Description
Action Encoder	d_action	256	Action embedding dimension
Action Encoder	n_heads	8	Attention heads
Action Encoder	lstm_hidden	256	LSTM hidden size
Action Encoder	conv_kernels	[3, 5, 7]	Multi-scale kernel sizes
Identity Encoder	d_identity	128	Identity embedding dimension
Identity Encoder	gcn_layers	3	GCN depth
Decoder	d_model	256	Decoder hidden dimension
Decoder	n_layers	6	Transformer layers
Decoder	n_heads	8	Attention heads
Training	batch_size	64	Samples per batch
Training	learning_rate	1e-4	Adam optimizer LR

Three-Stage Training Strategy

A curriculum-based approach that first establishes disentanglement, then learns reconstruction, and finally fine-tunes the complete system end-to-end.

Training Timeline

Progressive training with loss scheduling

flowchart LR
    subgraph Stage1[Stage 1: Encoder Pretraining]
        S1A[Train Encoders]
        S1B[Disentanglement Losses]
        S1C[20k iterations]
    end

    subgraph Stage2[Stage 2: Decoder Training]
        S2A[Freeze Encoders]
        S2B[Train Decoder]
        S2C[15k iterations]
    end

    subgraph Stage3[Stage 3: End-to-End]
        S3A[Unfreeze All]
        S3B[Joint Optimization]
        S3C[15k iterations]
    end

    Stage1 --> Stage2 --> Stage3

    style S1A fill:#6366f1,stroke:#6366f1,color:#fff
    style S1B fill:#6366f1,stroke:#6366f1,color:#fff
    style S1C fill:#6366f1,stroke:#6366f1,color:#fff
    style S2A fill:#10b981,stroke:#10b981,color:#fff
    style S2B fill:#10b981,stroke:#10b981,color:#fff
    style S2C fill:#10b981,stroke:#10b981,color:#fff
    style S3A fill:#f59e0b,stroke:#f59e0b,color:#000
    style S3B fill:#f59e0b,stroke:#f59e0b,color:#000
    style S3C fill:#f59e0b,stroke:#f59e0b,color:#000

Stage 1

Encoder Pretraining

Train encoders with disentanglement losses. Decoder frozen or absent.

Active Losses:

Action Recognition (AR)
Re-Identification (RI)
Contrastive (InfoNCE)
Adversarial (GRL)
Orthogonality
Mutual Information

Stage 2

Decoder Training

Train decoder with frozen encoders. Focus on reconstruction quality.

Active Losses:

MSE Reconstruction
Bone Length Consistency
Temporal Smoothness
Velocity Consistency
Joint Angle Limits
Foot Contact

Stage 3

End-to-End Fine-tuning

Unfreeze all parameters. Joint optimization with all losses.

Active Losses:

All Stage 1 losses
All Stage 2 losses
Lower learning rate
Gradient clipping

Stage 1

Encoder Pretraining Details

Establishing disentanglement before reconstruction

flowchart TB
    subgraph Inputs
        SRC[Source Motion]
        TGT[Target Skeleton]
    end

    subgraph Encoders
        AE[Action Encoder]
        IE[Identity Encoder]
    end

    subgraph ActionLosses
        AR[AR Loss]
        RI[RI Loss]
        ADV[Adversarial]
    end

    subgraph DisentLosses
        NCE[Contrastive]
        ORTH[Orthogonality]
        MI[Mutual Info]
    end

    SRC --> AE
    TGT --> IE
    AE --> AR
    AE --> RI
    AE --> ADV
    AE --> NCE
    AE --> ORTH
    AE --> MI
    IE --> NCE
    IE --> ORTH
    IE --> MI

    style AE fill:#6366f1,stroke:#6366f1,color:#fff
    style IE fill:#10b981,stroke:#10b981,color:#fff
    style AR fill:#22c55e,stroke:#22c55e,color:#fff
    style RI fill:#ef4444,stroke:#ef4444,color:#fff
    style ADV fill:#f59e0b,stroke:#f59e0b,color:#000

$$\mathcal{L}_{\text{Stage1}} = \lambda_{\text{AR}}\mathcal{L}_{\text{AR}} + \lambda_{\text{RI}}\mathcal{L}_{\text{RI}} + \lambda_{\text{NCE}}\mathcal{L}_{\text{NCE}} + \lambda_{\text{adv}}\mathcal{L}_{\text{adv}} + \lambda_{\text{orth}}\mathcal{L}_{\text{orth}} + \lambda_{\text{MI}}\mathcal{L}_{\text{MI}}$$

Key Insight: By pretraining encoders with disentanglement losses before introducing reconstruction, we ensure the latent spaces are well-separated before the decoder can exploit shortcuts.

Stage 2

Decoder Training Details

Learning reconstruction with frozen encoders

flowchart TB
    subgraph Frozen
        AE[Action Enc - Frozen]
        IE[Identity Enc - Frozen]
    end

    subgraph Decoder
        DEC[Factorized Decoder]
    end

    subgraph ReconLosses
        MSE[MSE Loss]
        EE[End-Effector]
    end

    subgraph PhysicalLosses
        BONE[Bone Length]
        SMOOTH[Smoothness]
        VEL[Velocity]
        JOINT[Joint Limits]
        FOOT[Foot Contact]
    end

    AE --> DEC
    IE --> DEC
    DEC --> MSE
    DEC --> EE
    DEC --> BONE
    DEC --> SMOOTH
    DEC --> VEL
    DEC --> JOINT
    DEC --> FOOT

    style AE fill:#6366f1,stroke:#6366f1,color:#fff
    style IE fill:#10b981,stroke:#10b981,color:#fff
    style DEC fill:#f59e0b,stroke:#f59e0b,color:#000

$$\mathcal{L}_{\text{Stage2}} = \lambda_{\text{MSE}}\mathcal{L}_{\text{MSE}} + \lambda_{\text{bone}}\mathcal{L}_{\text{bone}} + \lambda_{\text{smooth}}\mathcal{L}_{\text{smooth}} + \lambda_{\text{vel}}\mathcal{L}_{\text{vel}} + \lambda_{\text{joint}}\mathcal{L}_{\text{joint}} + \lambda_{\text{foot}}\mathcal{L}_{\text{foot}}$$

Loss Weights by Stage

Optimized weights from hyperparameter tuning

Loss	Stage 1	Stage 2	Stage 3	Purpose
$\lambda_{\text{AR}}$	1.0	0.0	0.5	Action recognition auxiliary
$\lambda_{\text{RI}}$	1.0	0.0	0.5	Re-ID minimization
$\lambda_{\text{NCE}}$	0.1	0.0	0.05	Contrastive disentanglement
$\lambda_{\text{adv}}$	0.5	0.0	0.25	Adversarial identity confusion
$\lambda_{\text{orth}}$	0.1	0.0	0.05	Orthogonality constraint
$\lambda_{\text{MI}}$	0.1	0.0	0.05	Mutual information minimization
$\lambda_{\text{MSE}}$	0.0	1.0	1.0	Reconstruction fidelity
$\lambda_{\text{bone}}$	0.0	0.5	0.5	Bone length consistency
$\lambda_{\text{smooth}}$	0.0	0.1	0.1	Temporal smoothness
$\lambda_{\text{vel}}$	0.0	0.1	0.1	Velocity distribution matching

configs/main_config.yaml

Loss Functions

A comprehensive set of losses organized into three categories: disentanglement, reconstruction, and physical plausibility constraints.

Stage 1

Disentanglement Losses

Ensuring action and identity representations are separated

Action Recognition (AR)

Ensures action embedding preserves action class information via cross-entropy classification.

$$\mathcal{L}_{\text{AR}} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$

where $\hat{y} = \text{softmax}(\text{MLP}(\mathbf{H}_{\text{action}}))$

Try it: AR Calculator

/ target:

AR: 87.00% Target: 85.00% Above target

Re-Identification (RI)

Minimizes identity information in action embedding by maximizing classification entropy.

$$\mathcal{L}_{\text{RI}} = -\mathcal{H}(\hat{p}_{\text{id}}) = \sum_{i=1}^{N} \hat{p}_i \log(\hat{p}_i)$$

Pushes identity predictions toward uniform distribution

Try it: RI Calculator

/ subjects:

RI: 5.00% Chance: 2.50% Above chance

⊥ Contrastive (InfoNCE)

Pulls same-action embeddings together, pushes different-action embeddings apart.

$$\mathcal{L}_{\text{NCE}} = -\log\frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_j^+)/\tau)}{\sum_{k}\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k)/\tau)}$$

Temperature $\tau = 0.07$, cosine similarity

⚔️ Adversarial (GRL)

Gradient Reversal Layer confuses identity discriminator during backprop.

$$\mathcal{L}_{\text{adv}} = -\mathcal{L}_{\text{disc}}(\text{GRL}(\mathbf{H}_{\text{action}}))$$

GRL reverses gradients: $\frac{\partial}{\partial \theta} = -\lambda \frac{\partial \mathcal{L}}{\partial \theta}$

⟂ Orthogonality

Enforces orthogonality between action and identity embedding spaces.

$$\mathcal{L}_{\text{orth}} = \left|\frac{\mathbf{H}_{\text{action}} \cdot \mathbf{H}_{\text{identity}}}{\|\mathbf{H}_{\text{action}}\| \|\mathbf{H}_{\text{identity}}\|}\right|$$

Minimizes absolute cosine similarity toward 0

Mutual Information

Minimizes statistical dependence via cross-correlation matrix.

$$\mathcal{L}_{\text{MI}} = \sum_{i \neq j} C_{ij}^2, \quad C = \frac{\mathbf{H}_a^T \mathbf{H}_i}{\sqrt{\text{Var}(\mathbf{H}_a)\text{Var}(\mathbf{H}_i)}}$$

Off-diagonal elements of cross-correlation should be zero

src/training/disentanglement_losses.py

Stage 2

Reconstruction Losses

Ensuring output motion matches target quality

MSE Reconstruction

Mean squared error between predicted and ground truth joint positions.

$$\mathcal{L}_{\text{MSE}} = \frac{1}{TVD}\sum_{t,v,d}(\hat{x}_{t,v,d} - x_{t,v,d})^2$$

Try it: MSE Calculator

Predicted (comma-separated):

Actual (comma-separated):

MSE: 0.000000 Below target

End-Effector

Higher weight on hands, feet, and head for perceptually important joints.

$$\mathcal{L}_{\text{EE}} = \sum_{v \in \mathcal{V}_{\text{ee}}} w_v \|\hat{\mathbf{x}}_v - \mathbf{x}_v\|^2$$

$\mathcal{V}_{\text{ee}} = \{\text{hands, feet, head}\}$

Temporal Smoothing

Penalizes jitter by minimizing second-order temporal differences.

$$\mathcal{L}_{\text{smooth}} = \frac{1}{T-2}\sum_{t=2}^{T-1}\|\hat{\mathbf{x}}_{t+1} - 2\hat{\mathbf{x}}_t + \hat{\mathbf{x}}_{t-1}\|^2$$

FID-Velocity

Fréchet distance between velocity distributions of generated and real motions.

$$\text{FID}_v = \|\mu_{\hat{v}} - \mu_v\|^2 + \text{Tr}(\Sigma_{\hat{v}} + \Sigma_v - 2(\Sigma_{\hat{v}}\Sigma_v)^{1/2})$$

src/training/loss.py

Stage 2-3

Physical Plausibility Losses

Ensuring biomechanically valid motion

Bone Length

Enforces constant bone lengths across all frames.

$$\mathcal{L}_{\text{bone}} = \sum_{e \in \mathcal{E}}\sum_t (l_e^{(t)} - \bar{l}_e)^2$$

Try it: BLC Calculator

Bone lengths per frame (comma-sep):

Mean: 0.4534 Var: 0.0000003

Near zero - consistent

Joint Limits

Penalizes anatomically impossible joint angles.

$$\mathcal{L}_{\text{joint}} = \sum_j \max(0, \theta_j - \theta_j^{\max})^2$$

Foot Contact

Reduces foot sliding during ground contact phases.

$$\mathcal{L}_{\text{foot}} = \sum_t c_t \|\mathbf{v}_{\text{foot}}^{(t)}\|^2$$

Momentum

Conservation of angular momentum for realistic dynamics.

$$\mathcal{L}_{\text{mom}} = \|\Delta \mathbf{L}\|^2$$

Smoothness

Minimizes acceleration magnitude for natural motion.

$$\mathcal{L}_{\text{TS}} = \frac{1}{T-2}\sum_t \|\mathbf{a}_t\|^2$$

Velocity MMD

Maximum Mean Discrepancy between velocity distributions.

$$\text{MMD}^2 = \mathbb{E}[k(\mathbf{v}, \mathbf{v}')] - 2\mathbb{E}[k(\mathbf{v}, \hat{\mathbf{v}})]$$

src/losses/physical_plausibility.py

Evaluation Metrics

Comprehensive evaluation across three dimensions: privacy protection, action utility preservation, and physical plausibility of generated motions.

Privacy Metrics

Measuring identity information leakage

Re-Identification Accuracy (RI)

Accuracy of a classifier trained to identify source subject from action embedding. Lower is better (target: random chance = 1/N).

$$\text{RI} = \frac{1}{M}\sum_{i=1}^{M} \mathbb{1}[\hat{s}_i = s_i]$$

Target: RI ≈ 1/N_subjects (e.g., 2.5% for 40 subjects)

⚔️ Identity Discriminator Accuracy

Accuracy of adversarial discriminator trying to predict identity. Lower is better (target: 50% = random).

$$\text{Disc}_{\text{acc}} = \frac{1}{M}\sum_{i=1}^{M} \mathbb{1}[D(\mathbf{H}_{\text{action}}^{(i)}) = s_i]$$

Target: Discriminator accuracy ≈ 50%

Utility Metrics

Measuring action information preservation

Action Recognition Accuracy (AR)

Accuracy of action classifier on action embeddings. Higher is better.

$$\text{AR} = \frac{1}{M}\sum_{i=1}^{M} \mathbb{1}[\hat{a}_i = a_i]$$

Target: AR > 85% (comparable to supervised baselines)

Reconstruction MSE

Mean squared error between retargeted and ground truth motion. Lower is better.

$$\text{MSE} = \frac{1}{TVD}\sum_{t,v,d}(\hat{x}_{t,v,d} - x_{t,v,d})^2$$

Target: MSE < 0.01 (normalized coordinates)

Physical Plausibility Metrics

Measuring biomechanical validity of generated motion

Bone Length Consistency (BLC)

Variance of bone lengths across frames. Should be near zero.

$$\text{BLC} = \frac{1}{|\mathcal{E}|}\sum_{e}\text{Var}_t(l_e^{(t)})$$

Joint Angle Limits (JAL)

Percentage of frames with anatomically valid joint angles.

$$\text{JAL} = \frac{1}{TJ}\sum_{t,j}\mathbb{1}[\theta_j^{(t)} \in [\theta_j^{\min}, \theta_j^{\max}]]$$

Temporal Smoothness (TS)

Average acceleration magnitude (lower = smoother).

$$\text{TS} = \frac{1}{T-2}\sum_{t=2}^{T-1}\|\mathbf{a}_t\|$$

Velocity Consistency (VC/MMD)

MMD between generated and real velocity distributions.

$$\text{VC} = \text{MMD}(\mathcal{V}_{\text{gen}}, \mathcal{V}_{\text{real}})$$

Foot Contact Consistency (FCC)

Correlation between detected contact and low foot velocity.

$$\text{FCC} = \text{Corr}(c_t, \mathbb{1}[\|\mathbf{v}_{\text{foot}}^{(t)}\| < \epsilon])$$

🦶 Foot Sliding Error (FSE)

Average foot velocity during detected contact frames.

$$\text{FSE} = \frac{\sum_t c_t \|\mathbf{v}_{\text{foot}}^{(t)}\|}{\sum_t c_t}$$

Metrics Summary

Target values and interpretation guide

Category	Metric	Direction	Target	Interpretation
Privacy	RI Accuracy	↓ Lower	≈ 1/N	Random chance = perfect privacy
Privacy	Disc. Accuracy	↓ Lower	≈ 50%	Discriminator confused
Utility	AR Accuracy	↑ Higher	> 85%	Action preserved
Utility	MSE	↓ Lower	< 0.01	High reconstruction quality
Physical	BLC	↓ Lower	≈ 0	Constant bone lengths
	JAL	↑ Higher	> 95%	Valid joint angles
	TS	↓ Lower	< 0.1	Smooth motion
	FSE	↓ Lower	< 0.01	No foot sliding

Interactive Calculators: Each loss/metric in the Loss Functions tab has an inline calculator. Click "Try it" to experiment with sample values and see how the metrics work in practice.

Ablation Studies

Two categories of systematic evaluation: Encoder-Side Ablations test what happens when we remove components from the baseline, while Transformer Tricks Ablations test entirely new alternative approaches.

Ablation Study Design

Two categories: Component removal vs. Alternative approaches

flowchart TB
    subgraph Baseline
        B_IN[Input] --> B_TC[Conv]
        B_TC --> B_LSTM[LSTM]
        B_LSTM --> B_ATT[Attention]
        B_ATT --> B_OUT[Output]
    end

    subgraph EncoderAblations
        E1[No Conv]
        E2[No LSTM]
        E3[Full-Seq Identity]
    end

    subgraph TransformerTricks
        T1[Position Tokens]
        T2[Dynamics Tokens]
        T3[Tokens + Codebook]
    end

    Baseline --> EncoderAblations
    Baseline --> TransformerTricks

    style B_TC fill:#6366f1,stroke:#6366f1,color:#fff
    style B_LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style B_ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff

Encoder-Side Ablations
Test the importance of each baseline component by removing it. The baseline uses Conv + LSTM + Attention (no tokenization).

Transformer Tricks Ablations
Test alternative approaches NOT in baseline: tokenization strategies and VQ codebook discretization. May improve privacy.

Reference

Full Baseline Model

Conv + LSTM + Attention architecture (NO tokenization, NO codebook)

flowchart LR
    subgraph ActionEncoder
        IN[Source Motion] --> TC[Conv]
        TC --> LSTM[BiLSTM]
        LSTM --> ATT[Self-Attn]
        ATT --> HA[Action Emb]
    end

    subgraph IdentityEncoder
        TGT[Target Skeleton] --> MEAN[Mean]
        MEAN --> GCN[GCN]
        GCN --> HI[Identity Emb]
    end

    subgraph Decoder
        HA --> XATTN[Cross-Attn]
        HI --> XATTN
        XATTN --> OUT[Output]
    end

    style TC fill:#6366f1,stroke:#6366f1,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style GCN fill:#10b981,stroke:#10b981,color:#fff
    style XATTN fill:#f59e0b,stroke:#f59e0b,color:#000

Baseline Configuration: Multi-scale temporal convolutions (k=3,5,7) → Bidirectional LSTM → Self-Attention → Action embedding. Identity encoder uses temporal mean → Spatial GCN → Identity embedding. No tokenization or codebook.

Encoder-Side Ablations (Remove components from baseline)

Running

No Temporal Convolutions

Removes multi-scale temporal convolutions. Input goes directly to LSTM.

flowchart LR
    IN[Input] --> X[SKIP]
    X --> LSTM[LSTM]
    LSTM --> ATT[Attention]
    ATT --> OUT[Output]

    style X fill:#ef4444,stroke:#ef4444,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff

Tests: Importance of local temporal pattern capture

Running

No LSTM

Removes bidirectional LSTM. Conv output goes directly to attention.

flowchart LR
    IN[Input] --> TC[Conv]
    TC --> X[SKIP]
    X --> ATT[Attention]
    ATT --> OUT[Output]

    style TC fill:#6366f1,stroke:#6366f1,color:#fff
    style X fill:#ef4444,stroke:#ef4444,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff

Tests: Importance of sequential/recurrent modeling

Running

Identity Full-Sequence

Identity encoder uses full sequence instead of temporal mean (static pose).

flowchart LR
    IN[Full Sequence] --> X[No Mean]
    X --> GCN[GCN]
    GCN --> OUT[Output]

    style X fill:#ef4444,stroke:#ef4444,color:#fff
    style GCN fill:#10b981,stroke:#10b981,color:#fff

Tests: Risk of action leakage into identity representation

Transformer Tricks Ablations (New alternative approaches - NOT in baseline)

Running

Token Position

Flattens raw positions per frame into tokens. Bypasses vel/acc computation and conv layers.

flowchart LR
    IN[Positions] --> TOK[Position Tokenizer]
    TOK --> ATT[Self-Attn]
    ATT --> LSTM[LSTM]
    LSTM --> GATE[Gate]
    GATE --> OUT[Output]

    style TOK fill:#ec4899,stroke:#ec4899,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style GATE fill:#10b981,stroke:#10b981,color:#fff

Tests: Simple tokenization vs multi-scale conv preprocessing

Running

Token Dynamics

Tokenizes pos + vel + acc + bone lengths per frame. Bypasses conv layers.

flowchart LR
    IN[Positions] --> TOK[Dynamics Tokenizer]
    TOK --> ATT[Self-Attn]
    ATT --> LSTM[LSTM]
    LSTM --> GATE[Gate]
    GATE --> OUT[Output]

    style TOK fill:#ec4899,stroke:#ec4899,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style GATE fill:#10b981,stroke:#10b981,color:#fff

Tests: Internal dynamics computation vs external preprocessing

Running

Token Dynamics + Codebook

Adds VQ-VAE style codebook to discretize dynamics tokens. May improve privacy via quantization.

flowchart LR
    IN[Positions] --> TOK[Dynamics Tokenizer]
    TOK --> VQ[VQ Codebook]
    VQ --> ATT[Self-Attn]
    ATT --> LSTM[LSTM]
    LSTM --> GATE[Gate]
    GATE --> OUT[Output]

    style TOK fill:#ec4899,stroke:#ec4899,color:#fff
    style VQ fill:#f59e0b,stroke:#f59e0b,color:#000
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style GATE fill:#10b981,stroke:#10b981,color:#fff

Tests: Discretization for stronger disentanglement

Expected Results

Hypothesized impact of each ablation by category

Category	Ablation	AR (↑)	RI (↓)	MSE (↓)	Physical
Baseline	Full Model	Reference	Reference	Reference	Reference
Encoder-Side	No Temporal Conv	↓ Worse	→ Similar	↓ Worse	↓ Worse
	No LSTM	↓ Worse	→ Similar	↓ Worse	↓ Slightly
	Identity Full-Seq	→ Similar	↓ Worse (leak)	→ Similar	→ Similar
Transformer Tricks	Token Position	? Unknown	→ Similar	? Unknown	→ Similar
	Token Dynamics	→ Similar?	↑ Better?	? Unknown	→ Similar
	Token + Codebook	→ Similar?	↑ Better?	? Unknown	→ Similar

Note: Encoder-side ablations test component removal from baseline. Transformer tricks are NEW approaches that may improve privacy (RI) at acceptable utility cost. Results will be updated as studies complete.