Loading…

Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. Sparse Mixer slightly outperforms (

Saved in:

Bibliographic Details
Published in:	arXiv.org 2022-10
Main Authors:	Lee-Thorp, James, Ainslie, Joshua
Format:	Article
Language:	English
Subjects:	Ablation Coders Distillation Inference Mixers Stability Training
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Staff View

Description
Summary:	We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. Sparse Mixer slightly outperforms (
ISSN:	2331-8422