When You Wear It in a Different Spot: Symmetry as an Inductive Bias
A wristband doesn't know which way you put it on. CycloFormer doesn't either, by construction.
A wrist-worn EMG band is a ring of 16 sensors that read the electrical signature of forearm muscles. Wear it the same way twice, and your model should produce the same prediction — same gesture, same finger angles, same everything. It rarely does.
The reason is mundane. People rotate the band when they put it on. Channel 0 might be the electrode over your flexor digitorum superficialis today, and channel 7 tomorrow. The underlying physiology is the same; the indexing of the signal is not.
The architectures in the original emg2pose benchmark deal with this by augmentation: rotate the channel axis randomly during training, hope the network learns to be robust. It mostly works, in the sense that test loss goes down. It's also fragile in the way augmentation is always fragile — you're asking the network to learn a symmetry you could have given it for free.
CycloFormer gives it for free. Specifically, the model is exactly ℤ₁₆-invariant: for any cyclic rotation σ_k of the 16 channels, f(σ_k · x) = f(x). Not approximately. Not after fine-tuning. By construction.
three pieces
The invariance comes from composing three pieces, each of which is ℤ₁₆-equivariant in isolation:
- A channel-shared TDS-CNN front end. The same convolutional weights process every electrode channel independently. Identical weights ⟹ the operation commutes with any channel permutation, including cyclic rotation.
- Circular rotary position embeddings (CRoPE) on the spatial axis. Standard RoPE encodes absolute position; CRoPE encodes the cyclic distance on the ring. Attention scores end up depending only on the relative offset between two channels modulo 16 — exactly the structure of ℤ₁₆.
- A permutation-invariant attention pool over the channel axis. Instead of concatenating or flattening the 16 channel tokens, we attend to them with a single learned query and take the softmax-weighted sum. Permuting the inputs to a sum doesn't change the sum.
Compose them — equivariant ∘ equivariant ∘ invariant — and you get an invariant network. The proof is one paragraph. The implementation is one file.
does the bias help?
In the ablations, yes — and more so where the model has to extrapolate. On the easy User split, each of the channel-shared CNN, CRoPE, and attention pool contributes less than 0.2°. On the harder Stage split, the same three components together contribute about +5°. The model has to use the symmetry, not just memorize through it.
A matched-parameter baseline that uses cyclic channel-rotation augmentation instead of architectural invariance does not close the gap. In our setup, augmentation alone slightly degrades performance — probably because forcing the network to memorize an extra invariance burns capacity it would rather spend on the task.
why this is the right place to use it
Group-equivariant architectures are everywhere — SE(3)-Transformers for point clouds, E(2)-CNNs for images, ℤ_n-CNNs for crystallography. Most of them target perception modalities where the symmetry is geometric and obvious. Biosignals are different: the symmetry isn't in the world the model sees, it's in how the world is sampled. The sensor array imposes its own group structure, and getting that structure right is roughly free in parameter count.
That makes it a good test case. If a hand-designed inductive bias buys you anything at scale, it should buy you the most where data is scarce and where the symmetry is exact. EMG is both. CycloFormer is what you get when you take that seriously.