昇腾服务器稳定性压测
为对昇腾服务器进行稳定性压测,我们可以使用crontab定时拉起测试脚本——白天开发工作,晚上进行压测。
准备压测程序
hccl提供了一些压测程序,可用于同时拉起所有NPU卡。
1. 环境准备
- CANN安装:参考官网安装文档。
-
openmpi安装:
# 下载源码
wget --no-check-certificate https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.5.tar.gz
# 解压源码
tar -xvf openmpi-4.1.5.tar.gz
cd openmpi-4.1.5
# 配置
./configure --disable-fortran --enable-ipv6 --prefix=/usr/local/openmpi
# 编译
make -j
# 安装
make install
-
hccl_test编译
cd ${ASCEND_HOME_PATH}/tools/hccl_test
make MPI_HOME=/usr/local/openmpi ASCEND_DIR=${ASCEND_HOME_PATH}
2. 压测脚本准备
- 假设保存为
/home/fuzz_test/fuzz_job.sh - 将
/path/to/cann修改为实际的CANN安装位置
#!/bin/bash
MODE=$1
FUZZ_JOB_CANN_SET_ENV_PATH=/path/to/cann
FUZZ_JOB_MPI_EXEC_PATH=/usr/local/openmpi/bin/mpirun
FUZZ_JOB_MPI_LIB_PATH=/usr/local/openmpi/lib
if [[ "$MODE" == "start" ]]; then
source $FUZZ_JOB_CANN_SET_ENV_PATH
export LD_LIBRARY_PATH=$FUZZ_JOB_MPI_LIB_PATH:$LD_LIBRARY_PATH
cd $ASCEND_HOME_PATH/tools/hccl_test
echo "=============all_reduce_test started!!!============="
echo "=====feel free to kill the process if needed :)====="
# 后台运行监控循环
nohup bash -c "
while true; do
mpirun --allow-run-as-root -n 16 ./bin/all_reduce_test -b 1G -e 1G -f 2 -p 16 > stress.log 2>&1
sleep 2
done
" > /dev/null 2>&1 &
echo "监控器PID: $!"
echo $! > /tmp/fuzz_job_watcher.pid
fi
if [[ "$MODE" == "stop" ]]; then
# 停止监控器
if [ -f /tmp/fuzz_job_watcher.pid ]; then
kill $(cat /tmp/fuzz_job_watcher.pid) 2>/dev/null
rm -f /tmp/fuzz_job_watcher.pid
fi
# 停止测试进程
pkill -9 mpirun
pkill -9 all_reduce_test
echo "=============all_reduce_test stopped!!!============="
fi
3. 定时拉起
crontab -e
# contents
0 0 * * * /home/fuzz_test/fuzz_job.sh start
0 8 * * * /home/fuzz_test/fuzz_job.sh stop
- 详情可搜索crontab语法说明。
参考资料
- CANN安装:安装CANN-CANN社区版8.5.0-昇腾社区
-
hccl_test:工具介绍-CANN社区版8.5.0-昇腾社区 -
crontab:Linux crontab 命令 | 菜鸟教程