HPL和HPCG

本文将会对HPL和HPCG测试进行一个入门教学，对小白进行一个演示。这两个测试是高性能计算的必学内容，在实践中可以锻炼Linux命令行能力、增强对计算机基本工具的理解。为了保险期间，笔者建议刚入门的时候选一台虚拟机或者云服务器操作，如果做了一些不当操作也不至于永久性伤害大不了重开一台，本文是基于FUNHPC中的P4算力进行演示的。在上面跑通之后，就可以尝试在自己的机器、集群上面进行正式的操作和优化了。

本文封面由gemini生成。

HPL和HPCG整个流程其实非常直观，跑通测试主要有以下几个步骤，就类似于你写了一个C++工程项目然后要在一台机器上安装运行，是一回事情：

graph LR
    %% 节点定义
    A["环境准备"] --> B["获取源码"]
    B --> C["编译配置"]
    C --> D["执行编译"]
    D --> E["参数调优"]
    E --> F["运行分析"]

    %% 样式定义：端庄大气色系
    classDef default fill:#E3F2FD,stroke:#0277BD,stroke-width:2px,color:#01579B;
    classDef highlight fill:#2C6A89,stroke:#1B3A5C,stroke-width:2px,color:#FFFFFF;
    
    %% 应用样式
    class A,B,C,D,E default;
    class F highlight;

    %% 链接线样式
    linkStyle default stroke:#0288D1,stroke-width:2px;

HPL

HPL由田纳西大学创新计算实验室开发。它是全球超算 TOP500 排名最权威的参考标准。简单来说，HPL就是通过求解大规模线性方程组 $ Ax=b $来衡量机器每秒能做多少次浮点运算。

CPU版本

我们需要如下核心工具，它们都有着对应的功能：

工具	说明
gcc / g++	C/C++ 编译器，负责构建 HPL 的主程序框架和逻辑。
gfortran	Fortran 编译器，用于处理 HPL 中对性能要求极高的数学计算底层。
make	自动化构建工具，根据配置文件 (Makefile) 指挥编译器完成源码组装。
OpenMPI	跨节点/进程的通信库，让多个 CPU 核心能通过“对讲机”协同并行计算。
OpenBLAS	高性能线性代数库，是 HPL 矩阵运算的“动力引擎”，直接决定计算速度。
wget / tar	基础工具，负责从网络下载源码包并进行解压。
numactl	内存亲和性管理工具，确保 CPU 访问的是最近的内存条以降低延迟。

我们先安装好运行所需的软件依赖，接着进入 HPL 项目文件的编译环节。这里需要先编写并检查好 Makefile 文件，随后即可启动编译过程。当项目顺利生成可执行文件后，我们需要配置好 HPL.dat 文件——这个文件的核心作用是为程序给定执行参数。最后，再写一个自动化的运行脚本，我们就可以准备开始正式的性能测试了。为了让小白更好地上手，笔者准备了一个执行脚本 deploy_hpl.sh ，读者可以用脚本批量执行命令来快速跑通HPL。脚本内有详细的注释，读者可以阅读来理解整个流程。

#!/bin/bash
################################################################################
# HPL (High Performance Linpack) CPU版本 一键部署脚本
# 适用于: Ubuntu 24.04, 9核CPU, 22GB内存
# 日期: 2025-12-18
################################################################################

set -e  # 遇到错误立即退出

echo "=========================================="
echo "开始部署 HPL CPU 版本"
echo "=========================================="

# ============================================================================
# 第一步：安装依赖包
# gcc/g++编译器、make工具、Fortran编译器、下载/解压工具、
# OpenMPI（并行计算）、OpenBLAS（高性能线性代数库）、numactl（NUMA控制）
# ============================================================================
echo ""
echo "[步骤 1/5] 安装编译和运行依赖..."
apt update
apt install -y build-essential make gfortran wget tar openmpi-bin libopenmpi-dev libopenblas-dev numactl

# ============================================================================
# 第二步：下载并解压 HPL 源码
# ============================================================================
echo ""
echo "[步骤 2/5] 下载 HPL 2.3 源码..."
mkdir -p /data/coding
cd /data/coding
rm -rf hpl-2.3 hpl-2.3.tar.gz
wget -O hpl-2.3.tar.gz https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
tar -xzf hpl-2.3.tar.gz
cd hpl-2.3

# ============================================================================
# 第三步：创建 Makefile 配置文件
# 关键修复点：
# 之前的坑：缺少 CCNOOPT、ARCHIVER、ARFLAGS、RANLIB 变量
# 导致 HPL_dlamch.c 编译时找不到 hpl.h 头文件
# 解决方法：添加完整的变量定义，特别是 CCNOOPT = $(HPL_DEFS)
# ============================================================================
echo ""
echo "[步骤 3/5] 创建编译配置文件..."

cat > Make.Linux_OpenMPI_OpenBLAS << 'EOF'
# ============================================================================
# HPL Makefile 配置文件
# 架构: Linux + OpenMPI + OpenBLAS
# ============================================================================

SHELL        = /bin/sh
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch

# 架构标识
ARCH         = Linux_OpenMPI_OpenBLAS

# 目录配置
TOPdir       = /data/coding/hpl-2.3
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
HPLlib       = $(LIBdir)/libhpl.a

# 编译器配置
CC           = mpicc
LINKER       = mpicc
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo

# 编译选项
HPL_OPTS     = -DHPL_CALL_CBLAS
HPL_DEFS     = $(HPL_OPTS) -I$(INCdir)

# 优化编译参数（带优化）
CCFLAGS      = -O3 -march=native -fomit-frame-pointer -funroll-loops $(HPL_DEFS)

# 无优化编译参数（关键！HPL_dlamch.c需要，这是之前的坑）
CCNOOPT      = $(HPL_DEFS)

# 链接参数
LINKFLAGS    = $(CCFLAGS)

# BLAS库配置（OpenBLAS）
LAdir        =
LAinc        =
LAlib        = -lopenblas -lpthread -lm -lgfortran

# HPL库配置
HPL_INCLUDES = -I$(INCdir) $(LAinc)
HPL_LIBS     = $(HPLlib) $(LAlib)
EOF

echo "配置文件创建完成，已修复 CCNOOPT 缺失问题"

# ============================================================================
# 第四步：编译 HPL
# ============================================================================
echo ""
echo "[步骤 4/5] 编译 HPL（需要几分钟）..."
make arch=Linux_OpenMPI_OpenBLAS

# 检查编译结果
if [ -f "bin/Linux_OpenMPI_OpenBLAS/xhpl" ]; then
    echo "✓ 编译成功！可执行文件: bin/Linux_OpenMPI_OpenBLAS/xhpl"
else
    echo "✗ 编译失败！"
    exit 1
fi

# ============================================================================
# 第五步：创建 HPL.dat 配置文件（带详细注释）
# ============================================================================
echo ""
echo "[步骤 5/5] 创建 HPL.dat 配置文件..."

cat > /data/coding/hpl-2.3/bin/Linux_OpenMPI_OpenBLAS/HPL.dat << 'EOF'
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # [问题规模配置] ========================================== # of problems sizes (N)
40000        Ns - 矩阵大小N（N×N矩阵）# 内存占用 ≈ 8*N^2 字节（双精度）# N=40000 约占用 12.8GB 内存，适合22GB系统 # 调优建议：使用总内存的60-80%，公式: N ≈ sqrt(内存GB * 0.7 * 10^9 / 8)
1            # [块大小配置] ============================================ # of NBs - 块大小数量
192          NBs - 分块大小 # 影响缓存利用率和通信粒度 # 典型值：128/192/256，需要实验确定最优值
0            # [进程网格配置] ========================================== PMAP process mapping (0=Row-major, 1=Column-major) # 0=行优先映射（推荐）
1            # of process grids (P x Q) - 进程网格数量
3            Ps - 进程行数 # 总进程数 = P × Q = 9（匹配9核CPU）
3            Qs - 进程列数 # 建议：P和Q尽量接近（方形网格性能更好）
16.0         # [数值精度配置] ========================================== threshold - 残差阈值 # 验证计算正确性，通常16.0即可
1            # [分解算法配置] ========================================== # of panel fact - 面板分解算法数量
2            PFACTs (0=left, 1=Crout, 2=Right) # 0=左视, 1=Crout, 2=右视（推荐）
1            # of recursive stopping criterium - 递归停止准则数量
4            NBMINs - 最小递归块大小
1            # of panels in recursion - 递归中的面板数量
2            NDIVs - 递归除数
1            # of recursive panel fact. - 递归面板分解算法数量
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # [通信算法配置] ========================================== # of broadcast - 广播算法数量
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) # 1=1-ring Modified（单环修改版，推荐）
1            # of lookahead depth - 前瞻深度数量
1            DEPTHs - 前瞻深度（影响通信与计算重叠）
2            SWAP (0=bin-exch,1=long,2=mix) # 2=混合交换（推荐）
64           swapping threshold - 交换阈值
0            # [矩阵存储格式] ========================================== L1 in (0=transposed,1=no-transposed) form # L矩阵存储：0=转置形式（推荐）
0            U  in (0=transposed,1=no-transposed) form # U矩阵存储：0=转置形式（推荐）
1            # [其他配置] ============================================== Equilibration (0=no,1=yes) # 是否进行平衡化：1=是（推荐）
8            memory alignment in double (> 0) # 内存对齐（双精度字数）：8（推荐）
EOF

echo "HPL.dat 配置文件创建完成（已添加详细注释）"

# ============================================================================
# 创建运行脚本
# ============================================================================
echo ""
echo "创建快速运行脚本..."

cat > /data/coding/run_hpl.sh << 'EOF'
#!/bin/bash
################################################################################
# HPL 快速运行脚本
################################################################################

cd /data/coding/hpl-2.3/bin/Linux_OpenMPI_OpenBLAS

# 设置环境变量（防止OpenBLAS自动多线程）
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

echo "开始运行 HPL 测试..."
echo "结果将保存到: /data/coding/hpl_output.txt"
echo ""

# 运行HPL（9进程，对应9核）
mpirun --allow-run-as-root -np 9 ./xhpl | tee /data/coding/hpl_output.txt

echo ""
echo "测试完成！结果已保存到 /data/coding/hpl_output.txt"
EOF

chmod +x /data/coding/run_hpl.sh

# ============================================================================
# 部署完成
# ============================================================================
echo ""
echo "=========================================="
echo "✓ HPL 部署完成！"
echo "=========================================="
echo ""
echo "使用方法："
echo "  1. 运行测试: bash /data/coding/run_hpl.sh"
echo "  2. 查看结果: cat /data/coding/hpl_output.txt"
echo ""
echo "配置文件位置："
echo "  - HPL.dat: /data/coding/hpl-2.3/bin/Linux_OpenMPI_OpenBLAS/HPL.dat"
echo "  - 可执行文件: /data/coding/hpl-2.3/bin/Linux_OpenMPI_OpenBLAS/xhpl"
echo ""
echo "性能调优提示："
echo "  - 修改 HPL.dat 中的 N 值调整问题规模（当前40000）"
echo "  - 修改 NB 值调整块大小（当前192，可尝试128/256）"
echo "  - P×Q保持=9（匹配你的9核）"
echo "=========================================="

#!/bin/bash
################################################################################
# HPL (High Performance Linpack) CPU版本 一键部署脚本
# 适用于: Ubuntu 24.04, 9核CPU, 22GB内存
# 日期: 2025-12-18
################################################################################

set -e  # 遇到错误立即退出

echo "=========================================="
echo "开始部署 HPL CPU 版本"
echo "=========================================="

# ============================================================================
# 第一步：安装依赖包
# gcc/g++编译器、make工具、Fortran编译器、下载/解压工具、
# OpenMPI（并行计算）、OpenBLAS（高性能线性代数库）、numactl（NUMA控制）
# ============================================================================
echo ""
echo "[步骤 1/5] 安装编译和运行依赖..."
apt update
apt install -y build-essential make gfortran wget tar openmpi-bin libopenmpi-dev libopenblas-dev numactl

# ============================================================================
# 第二步：下载并解压 HPL 源码
# ============================================================================
echo ""
echo "[步骤 2/5] 下载 HPL 2.3 源码..."
mkdir -p /data/coding
cd /data/coding
rm -rf hpl-2.3 hpl-2.3.tar.gz
wget -O hpl-2.3.tar.gz https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
tar -xzf hpl-2.3.tar.gz
cd hpl-2.3

# ============================================================================
# 第三步：创建 Makefile 配置文件
# 关键修复点：
# 之前的坑：缺少 CCNOOPT、ARCHIVER、ARFLAGS、RANLIB 变量
# 导致 HPL_dlamch.c 编译时找不到 hpl.h 头文件
# 解决方法：添加完整的变量定义，特别是 CCNOOPT = $(HPL_DEFS)
# ============================================================================
echo ""
echo "[步骤 3/5] 创建编译配置文件..."

cat > Make.Linux_OpenMPI_OpenBLAS << 'EOF'
# ============================================================================
# HPL Makefile 配置文件
# 架构: Linux + OpenMPI + OpenBLAS
# ============================================================================

SHELL        = /bin/sh
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch

# 架构标识
ARCH         = Linux_OpenMPI_OpenBLAS

# 目录配置
TOPdir       = /data/coding/hpl-2.3
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
HPLlib       = $(LIBdir)/libhpl.a

# 编译器配置
CC           = mpicc
LINKER       = mpicc
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo

# 编译选项
HPL_OPTS     = -DHPL_CALL_CBLAS
HPL_DEFS     = $(HPL_OPTS) -I$(INCdir)

# 优化编译参数（带优化）
CCFLAGS      = -O3 -march=native -fomit-frame-pointer -funroll-loops $(HPL_DEFS)

# 无优化编译参数（关键！HPL_dlamch.c需要，这是之前的坑）
CCNOOPT      = $(HPL_DEFS)

# 链接参数
LINKFLAGS    = $(CCFLAGS)

# BLAS库配置（OpenBLAS）
LAdir        =
LAinc        =
LAlib        = -lopenblas -lpthread -lm -lgfortran

# HPL库配置
HPL_INCLUDES = -I$(INCdir) $(LAinc)
HPL_LIBS     = $(HPLlib) $(LAlib)
EOF

echo "配置文件创建完成，已修复 CCNOOPT 缺失问题"

# ============================================================================
# 第四步：编译 HPL
# ============================================================================
echo ""
echo "[步骤 4/5] 编译 HPL（需要几分钟）..."
make arch=Linux_OpenMPI_OpenBLAS

# 检查编译结果
if [ -f "bin/Linux_OpenMPI_OpenBLAS/xhpl" ]; then
    echo "✓ 编译成功！可执行文件: bin/Linux_OpenMPI_OpenBLAS/xhpl"
else
    echo "✗ 编译失败！"
    exit 1
fi

# ============================================================================
# 第五步：创建 HPL.dat 配置文件（带详细注释）
# ============================================================================
echo ""
echo "[步骤 5/5] 创建 HPL.dat 配置文件..."

cat > /data/coding/hpl-2.3/bin/Linux_OpenMPI_OpenBLAS/HPL.dat << 'EOF'
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # [问题规模配置] ========================================== # of problems sizes (N)
40000        Ns - 矩阵大小N（N×N矩阵）# 内存占用 ≈ 8*N^2 字节（双精度）# N=40000 约占用 12.8GB 内存，适合22GB系统 # 调优建议：使用总内存的60-80%，公式: N ≈ sqrt(内存GB * 0.7 * 10^9 / 8)
1            # [块大小配置] ============================================ # of NBs - 块大小数量
192          NBs - 分块大小 # 影响缓存利用率和通信粒度 # 典型值：128/192/256，需要实验确定最优值
0            # [进程网格配置] ========================================== PMAP process mapping (0=Row-major, 1=Column-major) # 0=行优先映射（推荐）
1            # of process grids (P x Q) - 进程网格数量
3            Ps - 进程行数 # 总进程数 = P × Q = 9（匹配9核CPU）
3            Qs - 进程列数 # 建议：P和Q尽量接近（方形网格性能更好）
16.0         # [数值精度配置] ========================================== threshold - 残差阈值 # 验证计算正确性，通常16.0即可
1            # [分解算法配置] ========================================== # of panel fact - 面板分解算法数量
2            PFACTs (0=left, 1=Crout, 2=Right) # 0=左视, 1=Crout, 2=右视（推荐）
1            # of recursive stopping criterium - 递归停止准则数量
4            NBMINs - 最小递归块大小
1            # of panels in recursion - 递归中的面板数量
2            NDIVs - 递归除数
1            # of recursive panel fact. - 递归面板分解算法数量
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # [通信算法配置] ========================================== # of broadcast - 广播算法数量
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) # 1=1-ring Modified（单环修改版，推荐）
1            # of lookahead depth - 前瞻深度数量
1            DEPTHs - 前瞻深度（影响通信与计算重叠）
2            SWAP (0=bin-exch,1=long,2=mix) # 2=混合交换（推荐）
64           swapping threshold - 交换阈值
0            # [矩阵存储格式] ========================================== L1 in (0=transposed,1=no-transposed) form # L矩阵存储：0=转置形式（推荐）
0            U  in (0=transposed,1=no-transposed) form # U矩阵存储：0=转置形式（推荐）
1            # [其他配置] ============================================== Equilibration (0=no,1=yes) # 是否进行平衡化：1=是（推荐）
8            memory alignment in double (> 0) # 内存对齐（双精度字数）：8（推荐）
EOF

echo "HPL.dat 配置文件创建完成（已添加详细注释）"

# ============================================================================
# 创建运行脚本
# ============================================================================
echo ""
echo "创建快速运行脚本..."

cat > /data/coding/run_hpl.sh << 'EOF'
#!/bin/bash
################################################################################
# HPL 快速运行脚本
################################################################################

cd /data/coding/hpl-2.3/bin/Linux_OpenMPI_OpenBLAS

# 设置环境变量（防止OpenBLAS自动多线程）
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

echo "开始运行 HPL 测试..."
echo "结果将保存到: /data/coding/hpl_output.txt"
echo ""

# 运行HPL（9进程，对应9核）
mpirun --allow-run-as-root -np 9 ./xhpl | tee /data/coding/hpl_output.txt

echo ""
echo "测试完成！结果已保存到 /data/coding/hpl_output.txt"
EOF

chmod +x /data/coding/run_hpl.sh

# ============================================================================
# 部署完成
# ============================================================================
echo ""
echo "=========================================="
echo "✓ HPL 部署完成！"
echo "=========================================="
echo ""
echo "使用方法："
echo "  1. 运行测试: bash /data/coding/run_hpl.sh"
echo "  2. 查看结果: cat /data/coding/hpl_output.txt"
echo ""
echo "配置文件位置："
echo "  - HPL.dat: /data/coding/hpl-2.3/bin/Linux_OpenMPI_OpenBLAS/HPL.dat"
echo "  - 可执行文件: /data/coding/hpl-2.3/bin/Linux_OpenMPI_OpenBLAS/xhpl"
echo ""
echo "性能调优提示："
echo "  - 修改 HPL.dat 中的 N 值调整问题规模（当前40000）"
echo "  - 修改 NB 值调整块大小（当前192，可尝试128/256）"
echo "  - P×Q保持=9（匹配你的9核）"
echo "=========================================="

笔者在自己的系统里是把这个脚本保存为 data/coding/deploy_hpl.sh ，我们需要先赋予这个文件执行权限：

chmod +x /data/coding/deploy_hpl.sh

chmod +x /data/coding/deploy_hpl.sh

然后就可以执行这个脚本：

bash /data/coding/deploy_hpl.sh

bash /data/coding/deploy_hpl.sh

如果一切顺畅的话你就可以看见编译成功了，可能会出现几个警告，没有关系，我们已经可以开始测试了。 deploy_hpl.sh 这个脚本最后一步会生成另一个脚本 /data/coding/run_hpl.sh ，它是用来批量执行HPL测试命令的脚本。如果不用脚本，你每次在命令行里要写比较长的命令，不如直接在脚本里修改这些命令，然后执行脚本就行了。那么现在你就可以执行这个脚本了：

bash /data/coding/run_hpl.sh

bash /data/coding/run_hpl.sh

通过以上步骤你就可以开始跑CPU的HPL了！过几分钟命令执行完后，你应该可以在 /data/coding/hpl_output.txt 里看到测试的输出结果：

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   40000 
NB     :     192 
PMAP   : Row-major process mapping
P      :       3 
Q      :       3 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Right 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11R2R4       40000   192     3     3             146.58             2.9111e+02
HPL_pdgesv() start time Thu Dec 18 16:42:25 2025

HPL_pdgesv() end time   Thu Dec 18 16:44:52 2025

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.88812518e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   40000 
NB     :     192 
PMAP   : Row-major process mapping
P      :       3 
Q      :       3 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Right 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11R2R4       40000   192     3     3             146.58             2.9111e+02
HPL_pdgesv() start time Thu Dec 18 16:42:25 2025

HPL_pdgesv() end time   Thu Dec 18 16:44:52 2025

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.88812518e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

那就是已经成功跑通了一次HPL了！你可以接下来尝试去改变一些参数，比如调整 HPL.dat 和执行脚本里执行命令设置的参数，有一些要注意的地方，比如 HPL.dat 里面 $P$ 和 $Q$ 的乘积要等于执行命令里面的线程数。不用担心报错，你已经有一个成功执行的案例了，多问问AI很容易找到解决办法的。

GPU版本

HPCG

CPU版本

相比于HPL的CPU版本，HPCG的CPU版本会顺畅很多，不会遇见makefile文件中的一些坑。核心工具如下：

工具	说明
openmpi-bin	OpenMPI 可执行文件包，提供 mpirun、mpicc、mpicxx 等并行计算命令。
openmpi-common	OpenMPI 公共文件和配置，包含共享库和运行时环境。
libopenmpi-dev	OpenMPI 开发库，提供 MPI 头文件和静态库，用于编译 MPI 程序。
gcc / g++	C/C++ 编译器，由 libopenmpi-dev 自动安装，负责编译 HPCG 源码。
gfortran	Fortran 编译器，由依赖关系自动安装，虽然 HPCG 不直接使用但某些库需要。
make	构建自动化工具，根据 Makefile 协调编译流程（通常系统自带）。

我们依旧是给出bash脚本，可以看出相比于HPL的脚本确实是简单了许多。笔者将这个脚本保存为 /data/coding/deploy_hpcg.sh 。

# HPCG部署脚本 - 从安装到配置

# 第一步：安装依赖
apt-get update
apt-get install -y openmpi-bin openmpi-common libopenmpi-dev

# 第二步：下载源码
cd /data/coding
wget https://www.hpcg-benchmark.org/downloads/hpcg-3.1.tar.gz
tar -xzf hpcg-3.1.tar.gz
cd hpcg-3.1

# 第三步：配置编译环境
cp setup/Make.Linux_MPI .

# 第四步：编译
mkdir -p bin
make arch=Linux_MPI

# 第五步：配置测试参数
cd bin
cat > hpcg.dat << HPCGDAT
HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
104 104 104
60
HPCGDAT

echo "HPCG部署完成！"
echo "执行以下命令运行测试："
echo "cd /data/coding/hpcg-3.1/bin"
echo "mpirun --allow-run-as-root -np 4 ./xhpcg"

# HPCG部署脚本 - 从安装到配置

# 第一步：安装依赖
apt-get update
apt-get install -y openmpi-bin openmpi-common libopenmpi-dev

# 第二步：下载源码
cd /data/coding
wget https://www.hpcg-benchmark.org/downloads/hpcg-3.1.tar.gz
tar -xzf hpcg-3.1.tar.gz
cd hpcg-3.1

# 第三步：配置编译环境
cp setup/Make.Linux_MPI .

# 第四步：编译
mkdir -p bin
make arch=Linux_MPI

# 第五步：配置测试参数
cd bin
cat > hpcg.dat << HPCGDAT
HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
104 104 104
60
HPCGDAT

echo "HPCG部署完成！"
echo "执行以下命令运行测试："
echo "cd /data/coding/hpcg-3.1/bin"
echo "mpirun --allow-run-as-root -np 4 ./xhpcg"

赋予脚本执行权限，并且执行脚本：

chmod +x /data/coding/deploy_hpcg.sh
bash /data/coding/deploy_hpcg.sh

chmod +x /data/coding/deploy_hpcg.sh
bash /data/coding/deploy_hpcg.sh

配置好HPCG后，就可以执行了，你可以看到脚本给 HPCG.dat 设置的参数文本如下：

HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
104 104 104
60

HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
104 104 104
60

前面3个数代表问题规模各个维度的大小，最后一个数代表至少运行多少秒。一般要求正式的测试至少半小时也就是1800秒，不过我们这里只是为了跑通可以设置为60。

接下来就是直接执行HPCG了，可以直接单线程运行：

./xhpcg

./xhpcg

也可以多线程并行运行：

mpirun --allow-run-as-root -np 4 ./xhpcg

mpirun --allow-run-as-root -np 4 ./xhpcg

然后在HPCG的bin文件夹下，你可以看到两个txt文件 HPCG-Benchmark_3.1_<日期>_<时间>.txt 以及 hpcg<日期>T<时间>.txt 。从这两个文件可以看到测试结果的信息：

HPCG-Benchmark_3.1_<日期>_<时间>.txt：

HPCG-Benchmark
version=3.1
Release date=March 28, 2019
Machine Summary=
Machine Summary::Distributed Processes=1
Machine Summary::Threads per processes=1
Global Problem Dimensions=
Global Problem Dimensions::Global nx=104
Global Problem Dimensions::Global ny=104
Global Problem Dimensions::Global nz=104
Processor Dimensions=
Processor Dimensions::npx=1
Processor Dimensions::npy=1
Processor Dimensions::npz=1
Local Domain Dimensions=
Local Domain Dimensions::nx=104
Local Domain Dimensions::ny=104
Local Domain Dimensions::Lower ipz=0
Local Domain Dimensions::Upper ipz=0
Local Domain Dimensions::nz=104
########## Problem Summary  ##########=
Setup Information=
Setup Information::Setup Time=5.78039
Linear System Information=
Linear System Information::Number of Equations=1124864
Linear System Information::Number of Nonzero Terms=29791000
Multigrid Information=
Multigrid Information::Number of coarse grid levels=3
Multigrid Information::Coarse Grids=
Multigrid Information::Coarse Grids::Grid Level=1
Multigrid Information::Coarse Grids::Number of Equations=140608
Multigrid Information::Coarse Grids::Number of Nonzero Terms=3652264
Multigrid Information::Coarse Grids::Number of Presmoother Steps=1
Multigrid Information::Coarse Grids::Number of Postsmoother Steps=1
Multigrid Information::Coarse Grids::Grid Level=2
Multigrid Information::Coarse Grids::Number of Equations=17576
Multigrid Information::Coarse Grids::Number of Nonzero Terms=438976
Multigrid Information::Coarse Grids::Number of Presmoother Steps=1
Multigrid Information::Coarse Grids::Number of Postsmoother Steps=1
Multigrid Information::Coarse Grids::Grid Level=3
Multigrid Information::Coarse Grids::Number of Equations=2197
Multigrid Information::Coarse Grids::Number of Nonzero Terms=50653
Multigrid Information::Coarse Grids::Number of Presmoother Steps=1
Multigrid Information::Coarse Grids::Number of Postsmoother Steps=1
########## Memory Use Summary  ##########=
Memory Use Information=
Memory Use Information::Total memory used for data (Gbytes)=0.80393
Memory Use Information::Memory used for OptimizeProblem data (Gbytes)=0
Memory Use Information::Bytes per equation (Total memory / Number of Equations)=714.691
Memory Use Information::Memory used for linear system and CG (Gbytes)=0.70754
Memory Use Information::Coarse Grids=
Memory Use Information::Coarse Grids::Grid Level=1
Memory Use Information::Coarse Grids::Memory used=0.0845059
Memory Use Information::Coarse Grids::Grid Level=2
Memory Use Information::Coarse Grids::Memory used=0.0105637
Memory Use Information::Coarse Grids::Grid Level=3
Memory Use Information::Coarse Grids::Memory used=0.0013209
########## V&V Testing Summary  ##########=
Spectral Convergence Tests=
Spectral Convergence Tests::Result=PASSED
Spectral Convergence Tests::Unpreconditioned=
Spectral Convergence Tests::Unpreconditioned::Maximum iteration count=11
Spectral Convergence Tests::Unpreconditioned::Expected iteration count=12
Spectral Convergence Tests::Preconditioned=
Spectral Convergence Tests::Preconditioned::Maximum iteration count=1
Spectral Convergence Tests::Preconditioned::Expected iteration count=2
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon=
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon::Result=PASSED
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon::Departure for SpMV=2.34166e-08
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon::Departure for MG=2.70754e-08
########## Iterations Summary  ##########=
Iteration Count Information=
Iteration Count Information::Result=PASSED
Iteration Count Information::Reference CG iterations per set=50
Iteration Count Information::Optimized CG iterations per set=50
Iteration Count Information::Total number of reference iterations=250
Iteration Count Information::Total number of optimized iterations=250
########## Reproducibility Summary  ##########=
Reproducibility Information=
Reproducibility Information::Result=PASSED
Reproducibility Information::Scaled residual mean=4.99963e-08
Reproducibility Information::Scaled residual variance=0
########## Performance Summary (times in sec) ##########=
Benchmark Time Summary=
Benchmark Time Summary::Optimization phase=2.27e-07
Benchmark Time Summary::DDOT=1.05785
Benchmark Time Summary::WAXPBY=1.10945
Benchmark Time Summary::SpMV=9.14641
Benchmark Time Summary::MG=57.0875
Benchmark Time Summary::Total=68.4067
Floating Point Operations Summary=
Floating Point Operations Summary::Raw DDOT=1.69854e+09
Floating Point Operations Summary::Raw WAXPBY=1.69854e+09
Floating Point Operations Summary::Raw SpMV=1.51934e+10
Floating Point Operations Summary::Raw MG=8.47563e+10
Floating Point Operations Summary::Total=1.03347e+11
Floating Point Operations Summary::Total with convergence overhead=1.03347e+11
GB/s Summary=
GB/s Summary::Raw Read B/W=9.31008
GB/s Summary::Raw Write B/W=2.15166
GB/s Summary::Raw Total B/W=11.4617
GB/s Summary::Total with convergence and optimization phase overhead=10.9971
GFLOP/s Summary=
GFLOP/s Summary::Raw DDOT=1.60566
GFLOP/s Summary::Raw WAXPBY=1.53098
GFLOP/s Summary::Raw SpMV=1.66113
GFLOP/s Summary::Raw MG=1.48467
GFLOP/s Summary::Raw Total=1.51077
GFLOP/s Summary::Total with convergence overhead=1.51077
GFLOP/s Summary::Total with convergence and optimization phase overhead=1.44953
User Optimization Overheads=
User Optimization Overheads::Optimization phase time (sec)=2.27e-07
User Optimization Overheads::Optimization phase time vs reference SpMV+MG time=8.16786e-07
DDOT Timing Variations=
DDOT Timing Variations::Min DDOT MPI_Allreduce time=0.000962959
DDOT Timing Variations::Max DDOT MPI_Allreduce time=0.000962959
DDOT Timing Variations::Avg DDOT MPI_Allreduce time=0.000962959
Final Summary=
Final Summary::HPCG result is VALID with a GFLOP/s rating of=1.44953
Final Summary::HPCG 2.4 rating for historical reasons is=1.51077
Final Summary::Reference version of ComputeDotProduct used=Performance results are most likely suboptimal
Final Summary::Reference version of ComputeSPMV used=Performance results are most likely suboptimal
Final Summary::Reference version of ComputeMG used=Performance results are most likely suboptimal
Final Summary::Reference version of ComputeWAXPBY used=Performance results are most likely suboptimal
Final Summary::Results are valid but execution time (sec) is=68.4067
Final Summary::Official results execution time (sec) must be at least=1800

HPCG-Benchmark
version=3.1
Release date=March 28, 2019
Machine Summary=
Machine Summary::Distributed Processes=1
Machine Summary::Threads per processes=1
Global Problem Dimensions=
Global Problem Dimensions::Global nx=104
Global Problem Dimensions::Global ny=104
Global Problem Dimensions::Global nz=104
Processor Dimensions=
Processor Dimensions::npx=1
Processor Dimensions::npy=1
Processor Dimensions::npz=1
Local Domain Dimensions=
Local Domain Dimensions::nx=104
Local Domain Dimensions::ny=104
Local Domain Dimensions::Lower ipz=0
Local Domain Dimensions::Upper ipz=0
Local Domain Dimensions::nz=104
########## Problem Summary  ##########=
Setup Information=
Setup Information::Setup Time=5.78039
Linear System Information=
Linear System Information::Number of Equations=1124864
Linear System Information::Number of Nonzero Terms=29791000
Multigrid Information=
Multigrid Information::Number of coarse grid levels=3
Multigrid Information::Coarse Grids=
Multigrid Information::Coarse Grids::Grid Level=1
Multigrid Information::Coarse Grids::Number of Equations=140608
Multigrid Information::Coarse Grids::Number of Nonzero Terms=3652264
Multigrid Information::Coarse Grids::Number of Presmoother Steps=1
Multigrid Information::Coarse Grids::Number of Postsmoother Steps=1
Multigrid Information::Coarse Grids::Grid Level=2
Multigrid Information::Coarse Grids::Number of Equations=17576
Multigrid Information::Coarse Grids::Number of Nonzero Terms=438976
Multigrid Information::Coarse Grids::Number of Presmoother Steps=1
Multigrid Information::Coarse Grids::Number of Postsmoother Steps=1
Multigrid Information::Coarse Grids::Grid Level=3
Multigrid Information::Coarse Grids::Number of Equations=2197
Multigrid Information::Coarse Grids::Number of Nonzero Terms=50653
Multigrid Information::Coarse Grids::Number of Presmoother Steps=1
Multigrid Information::Coarse Grids::Number of Postsmoother Steps=1
########## Memory Use Summary  ##########=
Memory Use Information=
Memory Use Information::Total memory used for data (Gbytes)=0.80393
Memory Use Information::Memory used for OptimizeProblem data (Gbytes)=0
Memory Use Information::Bytes per equation (Total memory / Number of Equations)=714.691
Memory Use Information::Memory used for linear system and CG (Gbytes)=0.70754
Memory Use Information::Coarse Grids=
Memory Use Information::Coarse Grids::Grid Level=1
Memory Use Information::Coarse Grids::Memory used=0.0845059
Memory Use Information::Coarse Grids::Grid Level=2
Memory Use Information::Coarse Grids::Memory used=0.0105637
Memory Use Information::Coarse Grids::Grid Level=3
Memory Use Information::Coarse Grids::Memory used=0.0013209
########## V&V Testing Summary  ##########=
Spectral Convergence Tests=
Spectral Convergence Tests::Result=PASSED
Spectral Convergence Tests::Unpreconditioned=
Spectral Convergence Tests::Unpreconditioned::Maximum iteration count=11
Spectral Convergence Tests::Unpreconditioned::Expected iteration count=12
Spectral Convergence Tests::Preconditioned=
Spectral Convergence Tests::Preconditioned::Maximum iteration count=1
Spectral Convergence Tests::Preconditioned::Expected iteration count=2
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon=
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon::Result=PASSED
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon::Departure for SpMV=2.34166e-08
Departure from Symmetry |x'Ay-y'Ax|/(2*||x||*||A||*||y||)/epsilon::Departure for MG=2.70754e-08
########## Iterations Summary  ##########=
Iteration Count Information=
Iteration Count Information::Result=PASSED
Iteration Count Information::Reference CG iterations per set=50
Iteration Count Information::Optimized CG iterations per set=50
Iteration Count Information::Total number of reference iterations=250
Iteration Count Information::Total number of optimized iterations=250
########## Reproducibility Summary  ##########=
Reproducibility Information=
Reproducibility Information::Result=PASSED
Reproducibility Information::Scaled residual mean=4.99963e-08
Reproducibility Information::Scaled residual variance=0
########## Performance Summary (times in sec) ##########=
Benchmark Time Summary=
Benchmark Time Summary::Optimization phase=2.27e-07
Benchmark Time Summary::DDOT=1.05785
Benchmark Time Summary::WAXPBY=1.10945
Benchmark Time Summary::SpMV=9.14641
Benchmark Time Summary::MG=57.0875
Benchmark Time Summary::Total=68.4067
Floating Point Operations Summary=
Floating Point Operations Summary::Raw DDOT=1.69854e+09
Floating Point Operations Summary::Raw WAXPBY=1.69854e+09
Floating Point Operations Summary::Raw SpMV=1.51934e+10
Floating Point Operations Summary::Raw MG=8.47563e+10
Floating Point Operations Summary::Total=1.03347e+11
Floating Point Operations Summary::Total with convergence overhead=1.03347e+11
GB/s Summary=
GB/s Summary::Raw Read B/W=9.31008
GB/s Summary::Raw Write B/W=2.15166
GB/s Summary::Raw Total B/W=11.4617
GB/s Summary::Total with convergence and optimization phase overhead=10.9971
GFLOP/s Summary=
GFLOP/s Summary::Raw DDOT=1.60566
GFLOP/s Summary::Raw WAXPBY=1.53098
GFLOP/s Summary::Raw SpMV=1.66113
GFLOP/s Summary::Raw MG=1.48467
GFLOP/s Summary::Raw Total=1.51077
GFLOP/s Summary::Total with convergence overhead=1.51077
GFLOP/s Summary::Total with convergence and optimization phase overhead=1.44953
User Optimization Overheads=
User Optimization Overheads::Optimization phase time (sec)=2.27e-07
User Optimization Overheads::Optimization phase time vs reference SpMV+MG time=8.16786e-07
DDOT Timing Variations=
DDOT Timing Variations::Min DDOT MPI_Allreduce time=0.000962959
DDOT Timing Variations::Max DDOT MPI_Allreduce time=0.000962959
DDOT Timing Variations::Avg DDOT MPI_Allreduce time=0.000962959
Final Summary=
Final Summary::HPCG result is VALID with a GFLOP/s rating of=1.44953
Final Summary::HPCG 2.4 rating for historical reasons is=1.51077
Final Summary::Reference version of ComputeDotProduct used=Performance results are most likely suboptimal
Final Summary::Reference version of ComputeSPMV used=Performance results are most likely suboptimal
Final Summary::Reference version of ComputeMG used=Performance results are most likely suboptimal
Final Summary::Reference version of ComputeWAXPBY used=Performance results are most likely suboptimal
Final Summary::Results are valid but execution time (sec) is=68.4067
Final Summary::Official results execution time (sec) must be at least=1800

hpcg<日期>T<时间>.txt：

WARNING: PERFORMING UNPRECONDITIONED ITERATIONS
Call [0] Number of Iterations [11] Scaled Residual [8.0694e-13]
WARNING: PERFORMING UNPRECONDITIONED ITERATIONS
Call [1] Number of Iterations [11] Scaled Residual [8.0694e-13]
Call [0] Number of Iterations [1] Scaled Residual [1.96032e-16]
Call [1] Number of Iterations [1] Scaled Residual [1.96032e-16]
Departure from symmetry (scaled) for SpMV abs(x'*A*y - y'*A*x) = 2.34166e-08
Departure from symmetry (scaled) for MG abs(x'*Minv*y - y'*Minv*x) = 2.70754e-08
SpMV call [0] Residual [0]
SpMV call [1] Residual [0]
Call [0] Scaled Residual [4.99963e-08]
Call [1] Scaled Residual [4.99963e-08]
Call [2] Scaled Residual [4.99963e-08]
Call [3] Scaled Residual [4.99963e-08]
Call [4] Scaled Residual [4.99963e-08]

WARNING: PERFORMING UNPRECONDITIONED ITERATIONS
Call [0] Number of Iterations [11] Scaled Residual [8.0694e-13]
WARNING: PERFORMING UNPRECONDITIONED ITERATIONS
Call [1] Number of Iterations [11] Scaled Residual [8.0694e-13]
Call [0] Number of Iterations [1] Scaled Residual [1.96032e-16]
Call [1] Number of Iterations [1] Scaled Residual [1.96032e-16]
Departure from symmetry (scaled) for SpMV abs(x'*A*y - y'*A*x) = 2.34166e-08
Departure from symmetry (scaled) for MG abs(x'*Minv*y - y'*Minv*x) = 2.70754e-08
SpMV call [0] Residual [0]
SpMV call [1] Residual [0]
Call [0] Scaled Residual [4.99963e-08]
Call [1] Scaled Residual [4.99963e-08]
Call [2] Scaled Residual [4.99963e-08]
Call [3] Scaled Residual [4.99963e-08]
Call [4] Scaled Residual [4.99963e-08]

到此，你就跑通HPCG的CPU版本了。

GPU版本

以上就是HPL和HPCG在CPU和GPU上测试的一个演示。其实在如今的时代，你就算什么都不懂，也能借助AI指导来跑通这两个东西，只要你勤思考多提问，多看资料，结合AI强大的能力，很多东西都可以学会。如果你使用的是本文演示一样的机器，可以直接跑通，如果是其他机器和系统，本文也可以提供一个快捷成功的例子。总之多和AI交流，这个并不困难。希望这篇文章可以帮助到你，祝愿你能顺利跑通进入优化的阶段！

HPL

CPU版本

GPU版本

HPCG

CPU版本

GPU版本

Table of contents [hide]

随机推荐

如何玩转Github

量化学习资料推荐

健身学习资源推荐

使用Github建立个人博客网站

LEAVE A REPLY Cancel reply

成为评论者

我与这个站

怎么看这里

有趣的灵魂