Baidu AI researchers present SE-MoE which offers elastic MoE formation with 2D prefetch and merge communication on hierarchical storage

Source: https://arxiv.org/pdf/2205.10034v1.pdf

Machine learning and deep learning have gained popularity in areas such as computer vision (CV) and natural language processing (NLP), which require the analysis of large amounts of data such as images and text. Therefore, many computational resources are required for data processing. So, to address the above concern, sparsely activated neural networks based on mixture of experts (MoE) were used to train the larger models with little or no additional computational resources while achieving better training results.

Besides the advantages of using MoE, MOE models still face various challenges, as described below.

  1. IT challenges: Due to inadequate selection of experts, MoE models make training less effective. Various solutions, such as the introduction of additional losses, stochastic experts, etc., are used to avoid this. However, this leads to a greater emphasis on planning than on computing, which puts greater pressure on CPUs than GPUs.
  2. Communication challenges: The activation of parameters in the MoE is intimately linked to the input data. This results in the dreaded load imbalance when the data is unbalanced, even if the routing methods are efficient. When multitasking training is required for communication between devices, the load imbalance may cause the stride of the devices to change, prompting mutual expectation of synchronous communication. However, this leads to performance degradation.
  3. Storage Limits: Memory storage in computing devices severely limits MoE models. The performance of dense enabled models is often determined by the training time versus the memory required. All storage contains the same type of memory, but its I/O latency varies, resulting in different latency for parameters. Therefore, the challenge is to develop a unified and efficient storage management system to facilitate low-activity networks.

Accordingly, to overcome the challenges faced by the Ministry of Education, this document proposes an innovative merged framework for MoE training and inference. The significant contribution of the paper includes a novel SE-MoE, a distributed system capable of scaling MoE patterns to trillions of parameters and fully exploiting clusters, including high-bandwidth memory, CPU memory and SSDs to achieve effective training planning. Dynamic Graph Planning uses an innovative circular memory-based inference approach to overlap computation and communication as much as possible, resulting in more efficient inference performance without additional machines for larger MoE models. ladder. Moreover, various methods such as load balancing are used by the SE-MoE to improve the performance without any additional resources.

The formation of MoE is shown in Figure 1.

The experiments are divided into two parts: evaluation of training efficiency and inference performance. The results show that the training efficiency of SE-MoE outperforms that of the standard DeepSpeed ​​MoE system by achieving nearly 28% acceleration in single-node formation and a minimum acceleration of 33% in single-node formation. multiple nodes for MoE models with more than 100 billion. settings. Additionally, SE-MoE reduces GPU memory usage of each rank by approximately 12 GB. Whereas for inference performance on MoE models with over 200 billion parameters, SE-MoE achieves a speedup of almost 13% compared to DeepSpeed.

In addition, experiments are performed to evaluate elastic MoE training and to verify the effect of embedded partitions on MoE architecture. The results proved that implementing the integrated partition method in a single system can effectively minimize GPU memory usage. However, the proposed solution reduces GPU memory by 22.4%, 24.2%, and 26.3% while improving throughput by 4.2%, 11.2%, and 15.6%, respectively, when the size hidden increases.

Therefore, this article has proposed the SE-MoE model, MoE training, and an inference system that may well require NLP and CV domains. The study can be extended to find an alternative unified training and inference system that considers parameter severity and planning in multiple dimensions. Also, the unified system will successfully overcome communication, processing and storage limitations in a sparse formation.

This Article is written as a summay by Marktechpost Staff based on the Research Paper 'SE-MOE: A SCALABLE AND EFFICIENT MIXTURE-OF-EXPERTS DISTRIBUTED TRAINING AND INFERENCE SYSTEM'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, and Github.

Please Don't Forget To Join Our ML Subreddit

James G. Williams