Privacy-Preserving Skeleton Motion Retargeting via Explicit Architectural Disentanglement
DisentangledTMR is a privacy-preserving skeleton motion retargeting system that uses explicit architectural disentanglement to separate action dynamics from skeletal identity.
Skeleton motion data, despite lacking visual appearance, contains rich biometric signatures that enable person re-identification with >80% accuracy.
Transfer motion dynamics to a target skeleton while provably removing identity information through explicit architectural disentanglement.
$\mathbf{H}_{\text{action}}$ must preserve action class information
$\mathbf{H}_{\text{action}}$ must not leak source identity
$\mathbf{H}_{\text{identity}}$ must encode target skeleton structure
$\mathbf{H}_{\text{action}} \perp \mathbf{H}_{\text{identity}}$
flowchart LR
subgraph Input
S1[Source Motion
Person A]
S2[Target Skeleton
Person B]
end
subgraph Encoders
AE[Action Encoder
Temporal Conv + LSTM + Attn]
IE[Identity Encoder
Spatial GCN + Attn]
end
subgraph Decoder
FD[Factorized Decoder
Cross-Attention Fusion]
end
subgraph Output
OUT[Retargeted Motion
Action A on Body B]
end
S1 --> AE
S2 --> IE
AE --> FD
IE --> FD
FD --> OUT
style AE fill:#6366f1,stroke:#6366f1,color:#fff
style IE fill:#10b981,stroke:#10b981,color:#fff
style FD fill:#f59e0b,stroke:#f59e0b,color:#000
src/model/disentangled_tmr.py → DisentangledTMR
| Dataset | Subjects | Actions | Samples | Joints | Frames | Protocol |
|---|---|---|---|---|---|---|
| NTU RGB+D 60 | 40 | 60 | 56,880 | 25 | 64 | Cross-Subject / Cross-View |
| NTU RGB+D 120 | 106 | 120 | 114,480 | 25 | 64 | Cross-Subject / Cross-Setup |
| ETRI-Activity3D | 100 | 55 | 112,620 | 25 | 64 | Cross-Subject |
Three specialized components work together: Action Encoder extracts temporal dynamics, Identity Encoder captures skeletal structure, and Factorized Decoder fuses them for retargeting.
flowchart LR
subgraph Input
IN[Velocity + Acceleration]
end
subgraph CoreLayers
CONV[Temporal Convs]
ATT[Multi-Head Attention]
LSTM[Bi-LSTM]
end
subgraph Backbone
MIX[Action Recognition Architecture]
end
subgraph Fusion
GATE[Gating]
end
subgraph Output
H[Action Embedding]
end
IN --> CONV --> ATT --> LSTM
LSTM --> GATE
MIX --> GATE
GATE --> H
style CONV fill:#6366f1,stroke:#6366f1,color:#fff
style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
style MIX fill:#ec4899,stroke:#ec4899,color:#fff
style GATE fill:#10b981,stroke:#10b981,color:#fff
src/model/action_encoder.py → ActionEncoder
flowchart LR
subgraph Input
IN[Static Pose + Bones]
end
subgraph Processing
GCN[Spatial GCN]
SA[Spatial Attention]
POOL[Global Avg Pool]
end
subgraph Output
H[Identity Embedding]
end
IN --> GCN --> SA --> POOL --> H
style GCN fill:#10b981,stroke:#10b981,color:#fff
style SA fill:#06b6d4,stroke:#06b6d4,color:#fff
style POOL fill:#8b5cf6,stroke:#8b5cf6,color:#fff
src/model/identity_encoder.py → IdentityEncoder
flowchart LR
subgraph Inputs
FN[Frame n]
AE[Action Encoder]
IE[Identity Encoder]
end
subgraph DecoderLayer[Decoder Layer]
SA[Self-Attention]
EDA[Encoder-Decoder Attn]
ST[Style Transfer]
FFN[FFN]
end
subgraph Output
FN1[Frame n+1]
end
FN --> SA --> EDA
AE --> EDA
EDA --> ST
IE --> ST
ST --> FFN --> FN1
FN1 -->|Loop| FN
style SA fill:#f59e0b,stroke:#f59e0b,color:#000
style EDA fill:#6366f1,stroke:#6366f1,color:#fff
style ST fill:#10b981,stroke:#10b981,color:#fff
style FFN fill:#8b5cf6,stroke:#8b5cf6,color:#fff
src/model/factorized_decoder.py → FactorizedDecoder
| Component | Parameter | Value | Description |
|---|---|---|---|
| Action Encoder | d_action | 256 | Action embedding dimension |
| Action Encoder | n_heads | 8 | Attention heads |
| Action Encoder | lstm_hidden | 256 | LSTM hidden size |
| Action Encoder | conv_kernels | [3, 5, 7] | Multi-scale kernel sizes |
| Identity Encoder | d_identity | 128 | Identity embedding dimension |
| Identity Encoder | gcn_layers | 3 | GCN depth |
| Decoder | d_model | 256 | Decoder hidden dimension |
| Decoder | n_layers | 6 | Transformer layers |
| Decoder | n_heads | 8 | Attention heads |
| Training | batch_size | 64 | Samples per batch |
| Training | learning_rate | 1e-4 | Adam optimizer LR |
A curriculum-based approach that first establishes disentanglement, then learns reconstruction, and finally fine-tunes the complete system end-to-end.
flowchart LR
subgraph Stage1[Stage 1: Encoder Pretraining]
S1A[Train Encoders]
S1B[Disentanglement Losses]
S1C[20k iterations]
end
subgraph Stage2[Stage 2: Decoder Training]
S2A[Freeze Encoders]
S2B[Train Decoder]
S2C[15k iterations]
end
subgraph Stage3[Stage 3: End-to-End]
S3A[Unfreeze All]
S3B[Joint Optimization]
S3C[15k iterations]
end
Stage1 --> Stage2 --> Stage3
style S1A fill:#6366f1,stroke:#6366f1,color:#fff
style S1B fill:#6366f1,stroke:#6366f1,color:#fff
style S1C fill:#6366f1,stroke:#6366f1,color:#fff
style S2A fill:#10b981,stroke:#10b981,color:#fff
style S2B fill:#10b981,stroke:#10b981,color:#fff
style S2C fill:#10b981,stroke:#10b981,color:#fff
style S3A fill:#f59e0b,stroke:#f59e0b,color:#000
style S3B fill:#f59e0b,stroke:#f59e0b,color:#000
style S3C fill:#f59e0b,stroke:#f59e0b,color:#000
Train encoders with disentanglement losses. Decoder frozen or absent.
Active Losses:Train decoder with frozen encoders. Focus on reconstruction quality.
Active Losses:Unfreeze all parameters. Joint optimization with all losses.
Active Losses:
flowchart TB
subgraph Inputs
SRC[Source Motion]
TGT[Target Skeleton]
end
subgraph Encoders
AE[Action Encoder]
IE[Identity Encoder]
end
subgraph ActionLosses
AR[AR Loss]
RI[RI Loss]
ADV[Adversarial]
end
subgraph DisentLosses
NCE[Contrastive]
ORTH[Orthogonality]
MI[Mutual Info]
end
SRC --> AE
TGT --> IE
AE --> AR
AE --> RI
AE --> ADV
AE --> NCE
AE --> ORTH
AE --> MI
IE --> NCE
IE --> ORTH
IE --> MI
style AE fill:#6366f1,stroke:#6366f1,color:#fff
style IE fill:#10b981,stroke:#10b981,color:#fff
style AR fill:#22c55e,stroke:#22c55e,color:#fff
style RI fill:#ef4444,stroke:#ef4444,color:#fff
style ADV fill:#f59e0b,stroke:#f59e0b,color:#000
flowchart TB
subgraph Frozen
AE[Action Enc - Frozen]
IE[Identity Enc - Frozen]
end
subgraph Decoder
DEC[Factorized Decoder]
end
subgraph ReconLosses
MSE[MSE Loss]
EE[End-Effector]
end
subgraph PhysicalLosses
BONE[Bone Length]
SMOOTH[Smoothness]
VEL[Velocity]
JOINT[Joint Limits]
FOOT[Foot Contact]
end
AE --> DEC
IE --> DEC
DEC --> MSE
DEC --> EE
DEC --> BONE
DEC --> SMOOTH
DEC --> VEL
DEC --> JOINT
DEC --> FOOT
style AE fill:#6366f1,stroke:#6366f1,color:#fff
style IE fill:#10b981,stroke:#10b981,color:#fff
style DEC fill:#f59e0b,stroke:#f59e0b,color:#000
| Loss | Stage 1 | Stage 2 | Stage 3 | Purpose |
|---|---|---|---|---|
| $\lambda_{\text{AR}}$ | 1.0 | 0.0 | 0.5 | Action recognition auxiliary |
| $\lambda_{\text{RI}}$ | 1.0 | 0.0 | 0.5 | Re-ID minimization |
| $\lambda_{\text{NCE}}$ | 0.1 | 0.0 | 0.05 | Contrastive disentanglement |
| $\lambda_{\text{adv}}$ | 0.5 | 0.0 | 0.25 | Adversarial identity confusion |
| $\lambda_{\text{orth}}$ | 0.1 | 0.0 | 0.05 | Orthogonality constraint |
| $\lambda_{\text{MI}}$ | 0.1 | 0.0 | 0.05 | Mutual information minimization |
| $\lambda_{\text{MSE}}$ | 0.0 | 1.0 | 1.0 | Reconstruction fidelity |
| $\lambda_{\text{bone}}$ | 0.0 | 0.5 | 0.5 | Bone length consistency |
| $\lambda_{\text{smooth}}$ | 0.0 | 0.1 | 0.1 | Temporal smoothness |
| $\lambda_{\text{vel}}$ | 0.0 | 0.1 | 0.1 | Velocity distribution matching |
configs/main_config.yaml
A comprehensive set of losses organized into three categories: disentanglement, reconstruction, and physical plausibility constraints.
Ensures action embedding preserves action class information via cross-entropy classification.
where $\hat{y} = \text{softmax}(\text{MLP}(\mathbf{H}_{\text{action}}))$
Minimizes identity information in action embedding by maximizing classification entropy.
Pushes identity predictions toward uniform distribution
Pulls same-action embeddings together, pushes different-action embeddings apart.
Temperature $\tau = 0.07$, cosine similarity
Gradient Reversal Layer confuses identity discriminator during backprop.
GRL reverses gradients: $\frac{\partial}{\partial \theta} = -\lambda \frac{\partial \mathcal{L}}{\partial \theta}$
Enforces orthogonality between action and identity embedding spaces.
Minimizes absolute cosine similarity toward 0
Minimizes statistical dependence via cross-correlation matrix.
Off-diagonal elements of cross-correlation should be zero
src/training/disentanglement_losses.py
Mean squared error between predicted and ground truth joint positions.
Higher weight on hands, feet, and head for perceptually important joints.
$\mathcal{V}_{\text{ee}} = \{\text{hands, feet, head}\}$
Penalizes jitter by minimizing second-order temporal differences.
Fréchet distance between velocity distributions of generated and real motions.
src/training/loss.py
Enforces constant bone lengths across all frames.
Penalizes anatomically impossible joint angles.
Reduces foot sliding during ground contact phases.
Conservation of angular momentum for realistic dynamics.
Minimizes acceleration magnitude for natural motion.
Maximum Mean Discrepancy between velocity distributions.
src/losses/physical_plausibility.py
Comprehensive evaluation across three dimensions: privacy protection, action utility preservation, and physical plausibility of generated motions.
Accuracy of a classifier trained to identify source subject from action embedding. Lower is better (target: random chance = 1/N).
Accuracy of adversarial discriminator trying to predict identity. Lower is better (target: 50% = random).
Accuracy of action classifier on action embeddings. Higher is better.
Mean squared error between retargeted and ground truth motion. Lower is better.
Variance of bone lengths across frames. Should be near zero.
Percentage of frames with anatomically valid joint angles.
Average acceleration magnitude (lower = smoother).
MMD between generated and real velocity distributions.
Correlation between detected contact and low foot velocity.
Average foot velocity during detected contact frames.
| Category | Metric | Direction | Target | Interpretation |
|---|---|---|---|---|
| Privacy | RI Accuracy | ↓ Lower | ≈ 1/N | Random chance = perfect privacy |
| Disc. Accuracy | ↓ Lower | ≈ 50% | Discriminator confused | |
| Utility | AR Accuracy | ↑ Higher | > 85% | Action preserved |
| MSE | ↓ Lower | < 0.01 | High reconstruction quality | |
| Physical | BLC | ↓ Lower | ≈ 0 | Constant bone lengths |
| JAL | ↑ Higher | > 95% | Valid joint angles | |
| TS | ↓ Lower | < 0.1 | Smooth motion | |
| FSE | ↓ Lower | < 0.01 | No foot sliding |
Two categories of systematic evaluation: Encoder-Side Ablations test what happens when we remove components from the baseline, while Transformer Tricks Ablations test entirely new alternative approaches.
flowchart TB
subgraph Baseline
B_IN[Input] --> B_TC[Conv]
B_TC --> B_LSTM[LSTM]
B_LSTM --> B_ATT[Attention]
B_ATT --> B_OUT[Output]
end
subgraph EncoderAblations
E1[No Conv]
E2[No LSTM]
E3[Full-Seq Identity]
end
subgraph TransformerTricks
T1[Position Tokens]
T2[Dynamics Tokens]
T3[Tokens + Codebook]
end
Baseline --> EncoderAblations
Baseline --> TransformerTricks
style B_TC fill:#6366f1,stroke:#6366f1,color:#fff
style B_LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
style B_ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
flowchart LR
subgraph ActionEncoder
IN[Source Motion] --> TC[Conv]
TC --> LSTM[BiLSTM]
LSTM --> ATT[Self-Attn]
ATT --> HA[Action Emb]
end
subgraph IdentityEncoder
TGT[Target Skeleton] --> MEAN[Mean]
MEAN --> GCN[GCN]
GCN --> HI[Identity Emb]
end
subgraph Decoder
HA --> XATTN[Cross-Attn]
HI --> XATTN
XATTN --> OUT[Output]
end
style TC fill:#6366f1,stroke:#6366f1,color:#fff
style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
style GCN fill:#10b981,stroke:#10b981,color:#fff
style XATTN fill:#f59e0b,stroke:#f59e0b,color:#000
Removes multi-scale temporal convolutions. Input goes directly to LSTM.
flowchart LR
IN[Input] --> X[SKIP]
X --> LSTM[LSTM]
LSTM --> ATT[Attention]
ATT --> OUT[Output]
style X fill:#ef4444,stroke:#ef4444,color:#fff
style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
Removes bidirectional LSTM. Conv output goes directly to attention.
flowchart LR
IN[Input] --> TC[Conv]
TC --> X[SKIP]
X --> ATT[Attention]
ATT --> OUT[Output]
style TC fill:#6366f1,stroke:#6366f1,color:#fff
style X fill:#ef4444,stroke:#ef4444,color:#fff
style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
Identity encoder uses full sequence instead of temporal mean (static pose).
flowchart LR
IN[Full Sequence] --> X[No Mean]
X --> GCN[GCN]
GCN --> OUT[Output]
style X fill:#ef4444,stroke:#ef4444,color:#fff
style GCN fill:#10b981,stroke:#10b981,color:#fff
Flattens raw positions per frame into tokens. Bypasses vel/acc computation and conv layers.
flowchart LR
IN[Positions] --> TOK[Position Tokenizer]
TOK --> ATT[Self-Attn]
ATT --> LSTM[LSTM]
LSTM --> GATE[Gate]
GATE --> OUT[Output]
style TOK fill:#ec4899,stroke:#ec4899,color:#fff
style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
style GATE fill:#10b981,stroke:#10b981,color:#fff
Tokenizes pos + vel + acc + bone lengths per frame. Bypasses conv layers.
flowchart LR
IN[Positions] --> TOK[Dynamics Tokenizer]
TOK --> ATT[Self-Attn]
ATT --> LSTM[LSTM]
LSTM --> GATE[Gate]
GATE --> OUT[Output]
style TOK fill:#ec4899,stroke:#ec4899,color:#fff
style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
style GATE fill:#10b981,stroke:#10b981,color:#fff
Adds VQ-VAE style codebook to discretize dynamics tokens. May improve privacy via quantization.
flowchart LR
IN[Positions] --> TOK[Dynamics Tokenizer]
TOK --> VQ[VQ Codebook]
VQ --> ATT[Self-Attn]
ATT --> LSTM[LSTM]
LSTM --> GATE[Gate]
GATE --> OUT[Output]
style TOK fill:#ec4899,stroke:#ec4899,color:#fff
style VQ fill:#f59e0b,stroke:#f59e0b,color:#000
style ATT fill:#8b5cf6,stroke:#8b5cf6,color:#fff
style LSTM fill:#06b6d4,stroke:#06b6d4,color:#fff
style GATE fill:#10b981,stroke:#10b981,color:#fff
| Category | Ablation | AR (↑) | RI (↓) | MSE (↓) | Physical |
|---|---|---|---|---|---|
| Baseline | Full Model | Reference | Reference | Reference | Reference |
| Encoder-Side | No Temporal Conv | ↓ Worse | → Similar | ↓ Worse | ↓ Worse |
| No LSTM | ↓ Worse | → Similar | ↓ Worse | ↓ Slightly | |
| Identity Full-Seq | → Similar | ↓ Worse (leak) | → Similar | → Similar | |
| Transformer Tricks | Token Position | ? Unknown | → Similar | ? Unknown | → Similar |
| Token Dynamics | → Similar? | ↑ Better? | ? Unknown | → Similar | |
| Token + Codebook | → Similar? | ↑ Better? | ? Unknown | → Similar |