Research Paper

DisentangledTMR

Privacy-Preserving Skeleton Motion Retargeting via Explicit Architectural Disentanglement

3
Training Stages
6+
Disentanglement Losses
6
Physical Metrics
8
Ablation Studies

Project Overview

DisentangledTMR is a privacy-preserving skeleton motion retargeting system that uses explicit architectural disentanglement to separate action dynamics from skeletal identity.

The Privacy Problem
Why skeleton data needs protection

The Threat

Skeleton motion data, despite lacking visual appearance, contains rich biometric signatures that enable person re-identification with >80% accuracy.

  • 📏 Static cues: Bone lengths, limb ratios, body proportions
  • 🏃 Dynamic cues: Gait patterns, movement style, posture habits
  • 🔗 Linkage attacks: Cross-session tracking without labels

Our Solution

Transfer motion dynamics to a target skeleton while provably removing identity information through explicit architectural disentanglement.

Key Insight: By architecturally separating action and identity encoders, we ensure the action representation cannot leak identity information by construction.
Disentanglement Criteria
Four properties that define successful disentanglement
Action Retention

$\mathbf{H}_{\text{action}}$ must preserve action class information

Identity Removal

$\mathbf{H}_{\text{action}}$ must not leak source identity

Identity Capture

$\mathbf{H}_{\text{identity}}$ must encode target skeleton structure

Statistical Independence

$\mathbf{H}_{\text{action}} \perp \mathbf{H}_{\text{identity}}$

System Pipeline
End-to-end motion retargeting flow
flowchart LR
    subgraph Input
        S1[Source Motion
Person A] S2[Target Skeleton
Person B] end subgraph Encoders AE[Action Encoder
Temporal Conv + LSTM + Attn] IE[Identity Encoder
Spatial GCN + Attn] end subgraph Decoder FD[Factorized Decoder
Cross-Attention Fusion] end subgraph Output OUT[Retargeted Motion
Action A on Body B] end S1 --> AE S2 --> IE AE --> FD IE --> FD FD --> OUT style AE fill:#6366f1,stroke:#6366f1,color:#fff style IE fill:#10b981,stroke:#10b981,color:#fff style FD fill:#f59e0b,stroke:#f59e0b,color:#000

src/model/disentangled_tmr.py → DisentangledTMR

Datasets
Evaluation benchmarks for skeleton-based action recognition
Dataset Subjects Actions Samples Joints Frames Protocol
NTU RGB+D 60 40 60 56,880 25 64 Cross-Subject / Cross-View
NTU RGB+D 120 106 120 114,480 25 64 Cross-Subject / Cross-Setup
ETRI-Activity3D 100 55 112,620 25 64 Cross-Subject
All datasets use the same 25-joint skeleton topology (Kinect v2 format) with sequences normalized to T=64 frames via linear interpolation.

Model Architecture

Three specialized components work together: Action Encoder extracts temporal dynamics, Identity Encoder captures skeletal structure, and Factorized Decoder fuses them for retargeting.

Action Path
Action Encoder
Extracts identity-invariant temporal dynamics from source motion
flowchart LR
    subgraph Input
        IN[Velocity + Acceleration]
    end

    subgraph CoreLayers
        CONV[Temporal Convs]
        ATT[Multi-Head Attention]
        LSTM[Bi-LSTM]
    end

    subgraph Backbone
        MIX[Action Recognition Architecture]
    end

    subgraph Fusion
        GATE[Gating]
    end

    subgraph Output
        H[Action Embedding]
    end

    IN --> CONV --> ATT --> LSTM
    LSTM --> GATE
    MIX --> GATE
    GATE --> H

    style CONV fill:#6366f1,stroke:#6366f1,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style MIX fill:#ec4899,stroke:#ec4899,color:#fff
    style GATE fill:#10b981,stroke:#10b981,color:#fff
          

Key Components

  • Velocity/Acceleration: Removes static pose information, focuses on dynamics
  • Multi-scale Conv: Captures short (k=3), medium (k=5), and long (k=7) temporal patterns
  • Temporal Attention: Models long-range dependencies between frames
  • Bidirectional LSTM: Sequential modeling with forward/backward context

Architecture Equations

Velocity: $\mathbf{V}_t = \mathbf{s}_t - \mathbf{s}_{t-1}, \quad t \in [2, T]$
Acceleration: $\mathbf{A}_t = \mathbf{V}_t - \mathbf{V}_{t-1}, \quad t \in [3, T]$
Input: $\mathbf{X} = [\mathbf{s}; \mathbf{V}; \mathbf{A}] \in \mathbb{R}^{B \times 9 \times T \times V}$
Conv: $\mathbf{H}^{(k)} = \text{Conv1D}_k(\mathbf{X}), \quad k \in \{3, 5, 7\}$

src/model/action_encoder.py → ActionEncoder

Identity Path
Identity Encoder
Extracts static skeletal structure from target skeleton
flowchart LR
    subgraph Input
        IN[Static Pose + Bones]
    end

    subgraph Processing
        GCN[Spatial GCN]
        SA[Spatial Attention]
        POOL[Global Avg Pool]
    end

    subgraph Output
        H[Identity Embedding]
    end

    IN --> GCN --> SA --> POOL --> H

    style GCN fill:#10b981,stroke:#10b981,color:#fff
    style SA fill:#06b6d4,stroke:#06b6d4,color:#fff
    style POOL fill:#8b5cf6,stroke:#8b5cf6,color:#fff
          

Key Components

  • Static Pose: Temporal mean removes action-specific dynamics
  • Bone Lengths: 24 bone vectors computed from joint pairs
  • Spatial GCN: 3-layer graph convolution on skeleton topology
  • Global Pooling: Aggregates joint features into single identity vector

Architecture Equations

$$\bar{\mathbf{x}} = \frac{1}{T}\sum_{t=1}^{T}\mathbf{x}_t$$
$$\mathbf{b}_e = \bar{\mathbf{x}}_{j_1(e)} - \bar{\mathbf{x}}_{j_2(e)}, \quad e \in \mathcal{E}$$
$$\mathbf{H}_{\text{identity}} = \text{Pool}(\text{Attn}(\text{GCN}^{(3)}(\mathbf{B})))$$

src/model/identity_encoder.py → IdentityEncoder

Fusion
Factorized Decoder
Fuses action and identity representations via cross-attention
flowchart LR
    subgraph Inputs
        FN[Frame n]
        AE[Action Encoder]
        IE[Identity Encoder]
    end

    subgraph DecoderLayer[Decoder Layer]
        SA[Self-Attention]
        EDA[Encoder-Decoder Attn]
        ST[Style Transfer]
        FFN[FFN]
    end

    subgraph Output
        FN1[Frame n+1]
    end

    FN --> SA --> EDA
    AE --> EDA
    EDA --> ST
    IE --> ST
    ST --> FFN --> FN1
    FN1 -->|Loop| FN

    style SA fill:#f59e0b,stroke:#f59e0b,color:#000
    style EDA fill:#6366f1,stroke:#6366f1,color:#fff
    style ST fill:#10b981,stroke:#10b981,color:#fff
    style FFN fill:#8b5cf6,stroke:#8b5cf6,color:#fff
          

Key Components

  • Causal Self-Attention: Autoregressive generation with masked attention
  • Separate Cross-Attention: Independent attention to action and identity
  • Adaptive Fusion: Learned gating between action and identity contributions
  • Bone Correction: Post-hoc adjustment to match target bone lengths

Architecture Equations

$$\mathbf{Q} = \mathbf{H}\mathbf{W}_Q, \quad \mathbf{K}_a = \mathbf{H}_{\text{action}}\mathbf{W}_K^a$$
$$\mathbf{F} = \alpha \cdot \text{XAttn}(\mathbf{Q}, \mathbf{H}_{\text{action}}) + (1-\alpha) \cdot \text{XAttn}(\mathbf{Q}, \mathbf{H}_{\text{identity}})$$
$$\hat{\mathbf{x}} = \text{BoneCorrect}(\text{FFN}(\mathbf{F}), \mathbf{B}^{\text{tgt}})$$

src/model/factorized_decoder.py → FactorizedDecoder

Model Hyperparameters
Default configuration from main_config.yaml
Component Parameter Value Description
Action Encoderd_action256Action embedding dimension
Action Encodern_heads8Attention heads
Action Encoderlstm_hidden256LSTM hidden size
Action Encoderconv_kernels[3, 5, 7]Multi-scale kernel sizes
Identity Encoderd_identity128Identity embedding dimension
Identity Encodergcn_layers3GCN depth
Decoderd_model256Decoder hidden dimension
Decodern_layers6Transformer layers
Decodern_heads8Attention heads
Trainingbatch_size64Samples per batch
Traininglearning_rate1e-4Adam optimizer LR

Three-Stage Training Strategy

A curriculum-based approach that first establishes disentanglement, then learns reconstruction, and finally fine-tunes the complete system end-to-end.

Training Timeline
Progressive training with loss scheduling
flowchart LR
    subgraph Stage1[Stage 1: Encoder Pretraining]
        S1A[Train Encoders]
        S1B[Disentanglement Losses]
        S1C[20k iterations]
    end

    subgraph Stage2[Stage 2: Decoder Training]
        S2A[Freeze Encoders]
        S2B[Train Decoder]
        S2C[15k iterations]
    end

    subgraph Stage3[Stage 3: End-to-End]
        S3A[Unfreeze All]
        S3B[Joint Optimization]
        S3C[15k iterations]
    end

    Stage1 --> Stage2 --> Stage3

    style S1A fill:#6366f1,stroke:#6366f1,color:#fff
    style S1B fill:#6366f1,stroke:#6366f1,color:#fff
    style S1C fill:#6366f1,stroke:#6366f1,color:#fff
    style S2A fill:#10b981,stroke:#10b981,color:#fff
    style S2B fill:#10b981,stroke:#10b981,color:#fff
    style S2C fill:#10b981,stroke:#10b981,color:#fff
    style S3A fill:#f59e0b,stroke:#f59e0b,color:#000
    style S3B fill:#f59e0b,stroke:#f59e0b,color:#000
    style S3C fill:#f59e0b,stroke:#f59e0b,color:#000
          
Stage 1

Encoder Pretraining

Train encoders with disentanglement losses. Decoder frozen or absent.

Active Losses:
  • Action Recognition (AR)
  • Re-Identification (RI)
  • Contrastive (InfoNCE)
  • Adversarial (GRL)
  • Orthogonality
  • Mutual Information
Stage 2

Decoder Training

Train decoder with frozen encoders. Focus on reconstruction quality.

Active Losses:
  • MSE Reconstruction
  • Bone Length Consistency
  • Temporal Smoothness
  • Velocity Consistency
  • Joint Angle Limits
  • Foot Contact
Stage 3

End-to-End Fine-tuning

Unfreeze all parameters. Joint optimization with all losses.

Active Losses:
  • All Stage 1 losses
  • All Stage 2 losses
  • Lower learning rate
  • Gradient clipping
Stage 1
Encoder Pretraining Details
Establishing disentanglement before reconstruction
flowchart TB
    subgraph Inputs
        SRC[Source Motion]
        TGT[Target Skeleton]
    end

    subgraph Encoders
        AE[Action Encoder]
        IE[Identity Encoder]
    end

    subgraph ActionLosses
        AR[AR Loss]
        RI[RI Loss]
        ADV[Adversarial]
    end

    subgraph DisentLosses
        NCE[Contrastive]
        ORTH[Orthogonality]
        MI[Mutual Info]
    end

    SRC --> AE
    TGT --> IE
    AE --> AR
    AE --> RI
    AE --> ADV
    AE --> NCE
    AE --> ORTH
    AE --> MI
    IE --> NCE
    IE --> ORTH
    IE --> MI

    style AE fill:#6366f1,stroke:#6366f1,color:#fff
    style IE fill:#10b981,stroke:#10b981,color:#fff
    style AR fill:#22c55e,stroke:#22c55e,color:#fff
    style RI fill:#ef4444,stroke:#ef4444,color:#fff
    style ADV fill:#f59e0b,stroke:#f59e0b,color:#000
          
$$\mathcal{L}_{\text{Stage1}} = \lambda_{\text{AR}}\mathcal{L}_{\text{AR}} + \lambda_{\text{RI}}\mathcal{L}_{\text{RI}} + \lambda_{\text{NCE}}\mathcal{L}_{\text{NCE}} + \lambda_{\text{adv}}\mathcal{L}_{\text{adv}} + \lambda_{\text{orth}}\mathcal{L}_{\text{orth}} + \lambda_{\text{MI}}\mathcal{L}_{\text{MI}}$$
Key Insight: By pretraining encoders with disentanglement losses before introducing reconstruction, we ensure the latent spaces are well-separated before the decoder can exploit shortcuts.
Stage 2
Decoder Training Details
Learning reconstruction with frozen encoders
flowchart TB
    subgraph Frozen
        AE[Action Enc - Frozen]
        IE[Identity Enc - Frozen]
    end

    subgraph Decoder
        DEC[Factorized Decoder]
    end

    subgraph ReconLosses
        MSE[MSE Loss]
        EE[End-Effector]
    end

    subgraph PhysicalLosses
        BONE[Bone Length]
        SMOOTH[Smoothness]
        VEL[Velocity]
        JOINT[Joint Limits]
        FOOT[Foot Contact]
    end

    AE --> DEC
    IE --> DEC
    DEC --> MSE
    DEC --> EE
    DEC --> BONE
    DEC --> SMOOTH
    DEC --> VEL
    DEC --> JOINT
    DEC --> FOOT

    style AE fill:#6366f1,stroke:#6366f1,color:#fff
    style IE fill:#10b981,stroke:#10b981,color:#fff
    style DEC fill:#f59e0b,stroke:#f59e0b,color:#000
          
$$\mathcal{L}_{\text{Stage2}} = \lambda_{\text{MSE}}\mathcal{L}_{\text{MSE}} + \lambda_{\text{bone}}\mathcal{L}_{\text{bone}} + \lambda_{\text{smooth}}\mathcal{L}_{\text{smooth}} + \lambda_{\text{vel}}\mathcal{L}_{\text{vel}} + \lambda_{\text{joint}}\mathcal{L}_{\text{joint}} + \lambda_{\text{foot}}\mathcal{L}_{\text{foot}}$$
Loss Weights by Stage
Optimized weights from hyperparameter tuning
Loss Stage 1 Stage 2 Stage 3 Purpose
$\lambda_{\text{AR}}$1.00.00.5Action recognition auxiliary
$\lambda_{\text{RI}}$1.00.00.5Re-ID minimization
$\lambda_{\text{NCE}}$0.10.00.05Contrastive disentanglement
$\lambda_{\text{adv}}$0.50.00.25Adversarial identity confusion
$\lambda_{\text{orth}}$0.10.00.05Orthogonality constraint
$\lambda_{\text{MI}}$0.10.00.05Mutual information minimization
$\lambda_{\text{MSE}}$0.01.01.0Reconstruction fidelity
$\lambda_{\text{bone}}$0.00.50.5Bone length consistency
$\lambda_{\text{smooth}}$0.00.10.1Temporal smoothness
$\lambda_{\text{vel}}$0.00.10.1Velocity distribution matching

configs/main_config.yaml

Loss Functions

A comprehensive set of losses organized into three categories: disentanglement, reconstruction, and physical plausibility constraints.

Stage 1
Disentanglement Losses
Ensuring action and identity representations are separated

Action Recognition (AR)

Ensures action embedding preserves action class information via cross-entropy classification.

$$\mathcal{L}_{\text{AR}} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$

where $\hat{y} = \text{softmax}(\text{MLP}(\mathbf{H}_{\text{action}}))$

Try it: AR Calculator
/ target:
AR: 87.00% Target: 85.00% Above target

Re-Identification (RI)

Minimizes identity information in action embedding by maximizing classification entropy.

$$\mathcal{L}_{\text{RI}} = -\mathcal{H}(\hat{p}_{\text{id}}) = \sum_{i=1}^{N} \hat{p}_i \log(\hat{p}_i)$$

Pushes identity predictions toward uniform distribution

Try it: RI Calculator
/ subjects:
RI: 5.00% Chance: 2.50% Above chance

⊥ Contrastive (InfoNCE)

Pulls same-action embeddings together, pushes different-action embeddings apart.

$$\mathcal{L}_{\text{NCE}} = -\log\frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_j^+)/\tau)}{\sum_{k}\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k)/\tau)}$$

Temperature $\tau = 0.07$, cosine similarity

⚔️ Adversarial (GRL)

Gradient Reversal Layer confuses identity discriminator during backprop.

$$\mathcal{L}_{\text{adv}} = -\mathcal{L}_{\text{disc}}(\text{GRL}(\mathbf{H}_{\text{action}}))$$

GRL reverses gradients: $\frac{\partial}{\partial \theta} = -\lambda \frac{\partial \mathcal{L}}{\partial \theta}$

⟂ Orthogonality

Enforces orthogonality between action and identity embedding spaces.

$$\mathcal{L}_{\text{orth}} = \left|\frac{\mathbf{H}_{\text{action}} \cdot \mathbf{H}_{\text{identity}}}{\|\mathbf{H}_{\text{action}}\| \|\mathbf{H}_{\text{identity}}\|}\right|$$

Minimizes absolute cosine similarity toward 0

Mutual Information

Minimizes statistical dependence via cross-correlation matrix.

$$\mathcal{L}_{\text{MI}} = \sum_{i \neq j} C_{ij}^2, \quad C = \frac{\mathbf{H}_a^T \mathbf{H}_i}{\sqrt{\text{Var}(\mathbf{H}_a)\text{Var}(\mathbf{H}_i)}}$$

Off-diagonal elements of cross-correlation should be zero

src/training/disentanglement_losses.py

Stage 2
Reconstruction Losses
Ensuring output motion matches target quality

MSE Reconstruction

Mean squared error between predicted and ground truth joint positions.

$$\mathcal{L}_{\text{MSE}} = \frac{1}{TVD}\sum_{t,v,d}(\hat{x}_{t,v,d} - x_{t,v,d})^2$$
Try it: MSE Calculator
MSE: 0.000000 Below target

End-Effector

Higher weight on hands, feet, and head for perceptually important joints.

$$\mathcal{L}_{\text{EE}} = \sum_{v \in \mathcal{V}_{\text{ee}}} w_v \|\hat{\mathbf{x}}_v - \mathbf{x}_v\|^2$$

$\mathcal{V}_{\text{ee}} = \{\text{hands, feet, head}\}$

Temporal Smoothing

Penalizes jitter by minimizing second-order temporal differences.

$$\mathcal{L}_{\text{smooth}} = \frac{1}{T-2}\sum_{t=2}^{T-1}\|\hat{\mathbf{x}}_{t+1} - 2\hat{\mathbf{x}}_t + \hat{\mathbf{x}}_{t-1}\|^2$$

FID-Velocity

Fréchet distance between velocity distributions of generated and real motions.

$$\text{FID}_v = \|\mu_{\hat{v}} - \mu_v\|^2 + \text{Tr}(\Sigma_{\hat{v}} + \Sigma_v - 2(\Sigma_{\hat{v}}\Sigma_v)^{1/2})$$

src/training/loss.py

Stage 2-3
Physical Plausibility Losses
Ensuring biomechanically valid motion

Bone Length

Enforces constant bone lengths across all frames.

$$\mathcal{L}_{\text{bone}} = \sum_{e \in \mathcal{E}}\sum_t (l_e^{(t)} - \bar{l}_e)^2$$
Try it: BLC Calculator
Mean: 0.4534 Var: 0.0000003
Near zero - consistent

Joint Limits

Penalizes anatomically impossible joint angles.

$$\mathcal{L}_{\text{joint}} = \sum_j \max(0, \theta_j - \theta_j^{\max})^2$$

Foot Contact

Reduces foot sliding during ground contact phases.

$$\mathcal{L}_{\text{foot}} = \sum_t c_t \|\mathbf{v}_{\text{foot}}^{(t)}\|^2$$

Momentum

Conservation of angular momentum for realistic dynamics.

$$\mathcal{L}_{\text{mom}} = \|\Delta \mathbf{L}\|^2$$

Smoothness

Minimizes acceleration magnitude for natural motion.

$$\mathcal{L}_{\text{TS}} = \frac{1}{T-2}\sum_t \|\mathbf{a}_t\|^2$$

Velocity MMD

Maximum Mean Discrepancy between velocity distributions.

$$\text{MMD}^2 = \mathbb{E}[k(\mathbf{v}, \mathbf{v}')] - 2\mathbb{E}[k(\mathbf{v}, \hat{\mathbf{v}})]$$

src/losses/physical_plausibility.py

Evaluation Metrics

Comprehensive evaluation across three dimensions: privacy protection, action utility preservation, and physical plausibility of generated motions.

Privacy Metrics
Measuring identity information leakage

Re-Identification Accuracy (RI)

Accuracy of a classifier trained to identify source subject from action embedding. Lower is better (target: random chance = 1/N).

$$\text{RI} = \frac{1}{M}\sum_{i=1}^{M} \mathbb{1}[\hat{s}_i = s_i]$$
Target: RI ≈ 1/N_subjects (e.g., 2.5% for 40 subjects)

⚔️ Identity Discriminator Accuracy

Accuracy of adversarial discriminator trying to predict identity. Lower is better (target: 50% = random).

$$\text{Disc}_{\text{acc}} = \frac{1}{M}\sum_{i=1}^{M} \mathbb{1}[D(\mathbf{H}_{\text{action}}^{(i)}) = s_i]$$
Target: Discriminator accuracy ≈ 50%
Utility Metrics
Measuring action information preservation

Action Recognition Accuracy (AR)

Accuracy of action classifier on action embeddings. Higher is better.

$$\text{AR} = \frac{1}{M}\sum_{i=1}^{M} \mathbb{1}[\hat{a}_i = a_i]$$
Target: AR > 85% (comparable to supervised baselines)

Reconstruction MSE

Mean squared error between retargeted and ground truth motion. Lower is better.

$$\text{MSE} = \frac{1}{TVD}\sum_{t,v,d}(\hat{x}_{t,v,d} - x_{t,v,d})^2$$
Target: MSE < 0.01 (normalized coordinates)
Physical Plausibility Metrics
Measuring biomechanical validity of generated motion

Bone Length Consistency (BLC)

Variance of bone lengths across frames. Should be near zero.

$$\text{BLC} = \frac{1}{|\mathcal{E}|}\sum_{e}\text{Var}_t(l_e^{(t)})$$

Joint Angle Limits (JAL)

Percentage of frames with anatomically valid joint angles.

$$\text{JAL} = \frac{1}{TJ}\sum_{t,j}\mathbb{1}[\theta_j^{(t)} \in [\theta_j^{\min}, \theta_j^{\max}]]$$

Temporal Smoothness (TS)

Average acceleration magnitude (lower = smoother).

$$\text{TS} = \frac{1}{T-2}\sum_{t=2}^{T-1}\|\mathbf{a}_t\|$$

Velocity Consistency (VC/MMD)

MMD between generated and real velocity distributions.

$$\text{VC} = \text{MMD}(\mathcal{V}_{\text{gen}}, \mathcal{V}_{\text{real}})$$

Foot Contact Consistency (FCC)

Correlation between detected contact and low foot velocity.

$$\text{FCC} = \text{Corr}(c_t, \mathbb{1}[\|\mathbf{v}_{\text{foot}}^{(t)}\| < \epsilon])$$

🦶 Foot Sliding Error (FSE)

Average foot velocity during detected contact frames.

$$\text{FSE} = \frac{\sum_t c_t \|\mathbf{v}_{\text{foot}}^{(t)}\|}{\sum_t c_t}$$
Metrics Summary
Target values and interpretation guide
Category Metric Direction Target Interpretation
Privacy RI Accuracy ↓ Lower ≈ 1/N Random chance = perfect privacy
Disc. Accuracy ↓ Lower ≈ 50% Discriminator confused
Utility AR Accuracy ↑ Higher > 85% Action preserved
MSE ↓ Lower < 0.01 High reconstruction quality
Physical BLC ↓ Lower ≈ 0 Constant bone lengths
JAL ↑ Higher > 95% Valid joint angles
TS ↓ Lower < 0.1 Smooth motion
FSE ↓ Lower < 0.01 No foot sliding
Interactive Calculators: Each loss/metric in the Loss Functions tab has an inline calculator. Click "Try it" to experiment with sample values and see how the metrics work in practice.

Ablation Studies

Two categories of systematic evaluation: Encoder-Side Ablations test what happens when we remove components from the baseline, while Transformer Tricks Ablations test entirely new alternative approaches.

Ablation Study Design
Two categories: Component removal vs. Alternative approaches
flowchart TB
    subgraph Baseline
        B_IN[Input] --> B_TC[Conv]
        B_TC --> B_LSTM[LSTM]
        B_LSTM --> B_ATT[Attention]
        B_ATT --> B_OUT[Output]
    end

    subgraph EncoderAblations
        E1[No Conv]
        E2[No LSTM]
        E3[Full-Seq Identity]
    end

    subgraph TransformerTricks
        T1[Position Tokens]
        T2[Dynamics Tokens]
        T3[Tokens + Codebook]
    end

    Baseline --> EncoderAblations
    Baseline --> TransformerTricks

    style B_TC fill:#6366f1,stroke:#6366f1,color:#fff
    style B_LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style B_ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
          
Encoder-Side Ablations
Test the importance of each baseline component by removing it. The baseline uses Conv + LSTM + Attention (no tokenization).
Transformer Tricks Ablations
Test alternative approaches NOT in baseline: tokenization strategies and VQ codebook discretization. May improve privacy.
Reference
Full Baseline Model
Conv + LSTM + Attention architecture (NO tokenization, NO codebook)
flowchart LR
    subgraph ActionEncoder
        IN[Source Motion] --> TC[Conv]
        TC --> LSTM[BiLSTM]
        LSTM --> ATT[Self-Attn]
        ATT --> HA[Action Emb]
    end

    subgraph IdentityEncoder
        TGT[Target Skeleton] --> MEAN[Mean]
        MEAN --> GCN[GCN]
        GCN --> HI[Identity Emb]
    end

    subgraph Decoder
        HA --> XATTN[Cross-Attn]
        HI --> XATTN
        XATTN --> OUT[Output]
    end

    style TC fill:#6366f1,stroke:#6366f1,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style GCN fill:#10b981,stroke:#10b981,color:#fff
    style XATTN fill:#f59e0b,stroke:#f59e0b,color:#000
          
Baseline Configuration: Multi-scale temporal convolutions (k=3,5,7) → Bidirectional LSTM → Self-Attention → Action embedding. Identity encoder uses temporal mean → Spatial GCN → Identity embedding. No tokenization or codebook.

Encoder-Side Ablations (Remove components from baseline)

Running

No Temporal Convolutions

Removes multi-scale temporal convolutions. Input goes directly to LSTM.

flowchart LR
    IN[Input] --> X[SKIP]
    X --> LSTM[LSTM]
    LSTM --> ATT[Attention]
    ATT --> OUT[Output]

    style X fill:#ef4444,stroke:#ef4444,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
            
Tests: Importance of local temporal pattern capture
Running

No LSTM

Removes bidirectional LSTM. Conv output goes directly to attention.

flowchart LR
    IN[Input] --> TC[Conv]
    TC --> X[SKIP]
    X --> ATT[Attention]
    ATT --> OUT[Output]

    style TC fill:#6366f1,stroke:#6366f1,color:#fff
    style X fill:#ef4444,stroke:#ef4444,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
            
Tests: Importance of sequential/recurrent modeling
Running

Identity Full-Sequence

Identity encoder uses full sequence instead of temporal mean (static pose).

flowchart LR
    IN[Full Sequence] --> X[No Mean]
    X --> GCN[GCN]
    GCN --> OUT[Output]

    style X fill:#ef4444,stroke:#ef4444,color:#fff
    style GCN fill:#10b981,stroke:#10b981,color:#fff
            
Tests: Risk of action leakage into identity representation

Transformer Tricks Ablations (New alternative approaches - NOT in baseline)

Running

Token Position

Flattens raw positions per frame into tokens. Bypasses vel/acc computation and conv layers.

flowchart LR
    IN[Positions] --> TOK[Position Tokenizer]
    TOK --> ATT[Self-Attn]
    ATT --> LSTM[LSTM]
    LSTM --> GATE[Gate]
    GATE --> OUT[Output]

    style TOK fill:#ec4899,stroke:#ec4899,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style GATE fill:#10b981,stroke:#10b981,color:#fff
            
Tests: Simple tokenization vs multi-scale conv preprocessing
Running

Token Dynamics

Tokenizes pos + vel + acc + bone lengths per frame. Bypasses conv layers.

flowchart LR
    IN[Positions] --> TOK[Dynamics Tokenizer]
    TOK --> ATT[Self-Attn]
    ATT --> LSTM[LSTM]
    LSTM --> GATE[Gate]
    GATE --> OUT[Output]

    style TOK fill:#ec4899,stroke:#ec4899,color:#fff
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style GATE fill:#10b981,stroke:#10b981,color:#fff
            
Tests: Internal dynamics computation vs external preprocessing
Running

Token Dynamics + Codebook

Adds VQ-VAE style codebook to discretize dynamics tokens. May improve privacy via quantization.

flowchart LR
    IN[Positions] --> TOK[Dynamics Tokenizer]
    TOK --> VQ[VQ Codebook]
    VQ --> ATT[Self-Attn]
    ATT --> LSTM[LSTM]
    LSTM --> GATE[Gate]
    GATE --> OUT[Output]

    style TOK fill:#ec4899,stroke:#ec4899,color:#fff
    style VQ fill:#f59e0b,stroke:#f59e0b,color:#000
    style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
    style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
    style GATE fill:#10b981,stroke:#10b981,color:#fff
            
Tests: Discretization for stronger disentanglement
Expected Results
Hypothesized impact of each ablation by category
Category Ablation AR (↑) RI (↓) MSE (↓) Physical
Baseline Full Model Reference Reference Reference Reference
Encoder-Side No Temporal Conv ↓ Worse → Similar ↓ Worse ↓ Worse
No LSTM ↓ Worse → Similar ↓ Worse ↓ Slightly
Identity Full-Seq → Similar ↓ Worse (leak) → Similar → Similar
Transformer Tricks Token Position ? Unknown → Similar ? Unknown → Similar
Token Dynamics → Similar? ↑ Better? ? Unknown → Similar
Token + Codebook → Similar? ↑ Better? ? Unknown → Similar
Note: Encoder-side ablations test component removal from baseline. Transformer tricks are NEW approaches that may improve privacy (RI) at acceptable utility cost. Results will be updated as studies complete.