由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Programming版 - MLperf 1.0 果然就是HPC,現在已經跟SC21搞上了
进入Programming版参与讨论
1 (共1页)
m*****p
发帖数: 39
1
https://www.hpcwire.com/2021/06/30/latest-mlperf-results-nvidia-shines-but-
intel-graphcore-google-increase-their-presence/
如果沒有幾千個GPU,都不好意思提交測試報告。Nvidia 4096個A100確實給力,幾分鐘
就訓練一個模型,跟幾千顆TPUv4的Google AI在一個數量級,遠超其他公司。
Nvidia:
On the system side, the key building block of our at-scale training is the
NVIDIA DGX SuperPOD. DGX SuperPOD is the culmination of years of expertise
in HPC and AI data centers. It is based on the NVIDIA DGX A100 with the
latest NVIDIA A100 Tensor Core GPU, third-generation NVIDIA NVLink, NVSwitch
, and the NVIDIA ConnectX-6 VPI 200 Gbps HDR InfiniBand. These were combined
to make Selene a top 5 supercomputer in the Top 500 supercomputer list,
with the following components:
4480 NVIDIA A100 Tensor Core GPUs
560 NVIDIA DGX A100 systems
850 Mellanox 200G HDR InfiniBand switches
On the software side, the NGC container release v. 21.05 enhances and
enables several capabilities:
Distributed optimizer support enhancement.
Improved communication efficiency with Mellanox HDR Infiniband and NCCL 2.9.
9.
Added SHARP support. SHARP improves upon the performance of MPI and machine
learning collective operations. SHARP support was added to NCCL to offload
all-reduce collective operations into the network fabric, reducing the
amount of data traversing between nodes.
Google:
We achieved this by scaling up to 3,456 of our next-gen TPU v4 ASICs with
hundreds of CPU hosts for the multiple benchmarks. We achieved an average of
1.7x improvement in our top-line submissions compared to last year’s
results. This means we can now train some of the most common machine
learning models in a matter of seconds.
We achieved these performance improvements through continued investment in
both our hardware and software stacks. Part of the speedup comes from using
Google’s fourth-generation TPU ASIC, which offers a significant boost in
raw processing power over the previous generation, TPU v3. 4,096 of these
TPU v4 chips are networked together to create a TPU v4 Pod, with each pod
delivering 1.1 exaflop/s of peak performance.
In parallel, we introduced a number of new features into the XLA compiler to
improve the performance of any ML model running on TPU v4. One of these
features provides the ability to operate two (or potentially more) TPU cores
as a single logical device using a shared uniform memory access system.
This memory space unification allows the cores to easily share input and
output data - allowing for a more performant allocation of work across cores
. A second feature improves performance through a fine-grained overlap of
compute and communication. Finally, we introduced a technique to
automatically transform convolution operations such that space dimensions
are converted into additional batch dimensions. This technique improves
performance at the low batch sizes that are common at very large scales.
Though the margin of difference in topline MLPerf benchmarks can be measured
in mere seconds, this can translate to many days worth of training time on
the state-of-the-art models that comprise billions or trillions of
parameters. To give an example, today we can train a 4 trillion parameter
dense Transformer with GSPMD on 2048 TPU cores. For context, this is over 20
times larger than the GPT-3 model published by OpenAI last year. We are
already using TPU v4 Pods extensively within Google to develop research
breakthroughs such as MUM and LaMDA, and improve our core products such as
Search, Assistant and Translate.
1 (共1页)
进入Programming版参与讨论