共计 8 篇文章
2024
【论文阅读】{MegaScale}:Scaling Large Language Model Training to More Than 10,000 {GPUs}
    
  
    
    
      
      【论文阅读】Fluid:Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
    
  
    
    
      
      【论文阅读】Gödel:Unified Large-Scale Resource Management and Scheduling at ByteDance
    
  
    
    
      
      【论文阅读】Not All Resources are Visible:Exploiting Fragmented Shadow Resources in Shared-State Scheduler Architecture
    
  
    
      
      2023
【论文阅读】In Search of an Understandable Consensus Algorithm
    
  
    
    
      
      【论文阅读】The Design of a Practical System for Fault-Tolerant Virtual Machines
    
  
    
    
      
      【论文阅读】The Google file System
    
  
    
    
      
      【论文阅读】MapReduce: Simplified Data Processing on Large Clusters