为对昇腾服务器进行稳定性压测,我们可以使用crontab定时拉起测试脚本——白天开发工作,晚上进行压测。

准备压测程序

hccl提供了一些压测程序,可用于同时拉起所有NPU卡。

1. 环境准备

  • CANN安装:参考官网安装文档。
  • openmpi安装:
# 下载源码
wget --no-check-certificate https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz
# 解压源码
tar -xvf openmpi-4.1.5.tar.gz
cd openmpi-4.1.5
# 配置
./configure --disable-fortran --enable-ipv6 --prefix=/usr/local/openmpi
# 编译
make -j
# 安装
make install
  • hccl_test编译
cd ${ASCEND_HOME_PATH}/tools/hccl_test
make MPI_HOME=/usr/local/openmpi ASCEND_DIR=${ASCEND_HOME_PATH}

2. 压测脚本准备

  • 假设保存为/home/fuzz_test/fuzz_job.sh
  • /path/to/cann修改为实际的CANN安装位置
#!/bin/bash
MODE=$1
FUZZ_JOB_CANN_SET_ENV_PATH=/path/to/cann
FUZZ_JOB_MPI_EXEC_PATH=/usr/local/openmpi/bin/mpirun
FUZZ_JOB_MPI_LIB_PATH=/usr/local/openmpi/lib
if [[ "$MODE" == "start" ]]; then
    source $FUZZ_JOB_CANN_SET_ENV_PATH
    export LD_LIBRARY_PATH=$FUZZ_JOB_MPI_LIB_PATH:$LD_LIBRARY_PATH
    cd $ASCEND_HOME_PATH/tools/hccl_test
    echo "=============all_reduce_test started!!!============="
    echo "=====feel free to kill the process if needed :)====="
    # 后台运行监控循环
    nohup bash -c "
    while true; do
        mpirun --allow-run-as-root -n 16 ./bin/all_reduce_test -b 1G -e 1G -f 2 -p 16 > stress.log 2>&1
        sleep 2
    done
    " > /dev/null 2>&1 &
    echo "监控器PID: $!"
    echo $! > /tmp/fuzz_job_watcher.pid
fi
if [[ "$MODE" == "stop" ]]; then
    # 停止监控器
    if [ -f /tmp/fuzz_job_watcher.pid ]; then
        kill $(cat /tmp/fuzz_job_watcher.pid) 2>/dev/null
        rm -f /tmp/fuzz_job_watcher.pid
    fi
    # 停止测试进程
    pkill -9 mpirun
    pkill -9 all_reduce_test
    echo "=============all_reduce_test stopped!!!============="
fi

3. 定时拉起

crontab -e
# contents
0 0 * * * /home/fuzz_test/fuzz_job.sh start
0 8 * * * /home/fuzz_test/fuzz_job.sh stop
  • 详情可搜索crontab语法说明。

参考资料