Phase 3 System
Important
On Tuesday June 4th 2024 at 9am “PHASE 3” will be switched OFF and will be moved to Bristol to be part of Isambard 3 which will be launched over the summer.
Isambard Phase 3, just like the MACS, hosts many nodes of different architectures:
1x Login nodes with AMD EPYC 7713 64-Core Processor “Milan” CPU
2x nodes of Nvidia Ampere GPU with 4x Nvidia A100 40GB SXM “Ampere” GPUs and AMD EPYC 7543P 32-Core Processor “Milan” CPU
2x nodes of AMD Instinct GPU nodes with 4x AMD Instinct “MI100” GPU and AMD EPYC 7543P 32-Core Processor “Milan” CPU
12x nodes dual socket AMD EPYC 7713 64-Core Processor “Milan” @ 2.0 GHz, 256 GB/node 16x16GB-DDR4-3200
Phase 3 nodes run Red Hat Enterprise Linux 8 with the Cray software stack.
All nodes are connected via Slingshot 10. The login nodes are connected to the Internet via a 10 Gigabit link to the Janet Network.
Lustre Storage
Due to migration to newer storage old data can found read-only at /lustreOld e.g. previous home directory at /lustreOld/home
We have already copied over software from /lustreOld/software/x86 to the new /lustre location.
Nvidia GPU
There is a Nvidia SDK install on each of the ampereq nodes in /opt/nvidia but you can also load a latest version using:
module use /software/x86/tools/nvidia/hpc_sdk/modulefiles
module load nvhpc/22.9
To submit a job with 1 GPU and 1 CPU use:
qsub -I -q ampereq -l select=1:ncpus=1:ngpus=1:mem=100G
This sets CUDA_VISIBLE_DEVICES to a UUID (unique id) of a GPU that PBS provided you and sets 100GB of memory along with 1 CPU.
AMD GPU
There is a AMD ROCM install on each instinctq nodes in /opt/rocm but you can also use the Cray compiler.
module load craype-accel-amd-gfx908
To submit a job with 1 GPU and 1 CPU use:
qsub -I -q instinctq -l select=1:ncpus=1 -l place=excl
For now the PBS is not setup to support reserving AMD GPUs so we ask users to exclusively use the node (with -l place=excl) and not specify ngpus in the resource line.
Cray Compiler
Compiling
Compiling can be perfomed on the p3-login node.
The default modules should provide the required environment.
cc test.c
To compile with MPI and OpenMP, the following can be used:
cc -h omp test.c
Running a job
For example to run a job on a single node the milanq:
qsub -q milanq -l select=1:ncpus=128:mpiprocs=128
To run on 2 nodes you could use:
qsub -q milanq -l select=2:ncpus=128:mpiprocs=128 -l place=scatter:excl
This will place each request on different nodes - since we have hyperthreading enabled it would otherwise place them on the same node.
Then use Cray module to launch the MPI job
module load cray-pals
mpirun hostname
Intel OneAPI
Compiling
Compiling can be perfomed on the p3-login node.
Please load the IntelOneApi module for both mpi and compiler
module load IntelOneApi/compiler
module load IntelOneApi/mpi
This will make the standard Intel tools available - being AMD processors we recommend using advice from AMD which suggests using
icc -march core-avx2
To compile with MPI, the following can be used:
mpiicc -march core-avx2 -fopenmp
Running a job
The system can use Intel MPI and related Compilers (load modules as above). For example to run a job on a single node the milanq:
qsub -q milanq -l select=1:ncpus=128:mpiprocs=128
To run on 2 nodes you could use:
qsub -q milanq -l select=2:ncpus=128:mpiprocs=128 -l place=scatter:excl
This will place each request on different nodes - since we have hyperthreading enabled it would otherwise place them on the same node.
Then use ssh to laucnh the MPI job
mpirun -launcher ssh hostname
Apptainer/Singularity
To allow containers to run on the system, which is common for GPU applications, we have installed apptainer (fork from original Singularity project).
Nvidia
For example to download a Nvidia Tensorflow container run on the login node:
apptainer pull docker://nvcr.io/nvidia/tensorflow:23.11-tf2-py3
To run this on a Nvidia GPU
qsub -I -q ampereq -l select=1:ngpus=1:mem=64g
Then run on the compute node:
apptainer shell tensorflow_23.11-tf2-py3.sif
Due to issue with initialisation of these containers with apptainer, rerun a script:
source /etc/shinit_v2
Then run python3 with some Tensorflow code.
AMD
Similar can be done on AMD GPUs in instinctq downloading container on login node:
apptainer pull docker://rocm/tensorflow
Then run on the AMD GPU queue.
qsub -I -q instinctq -l select=1:mem=64g -l place=excl
Finally run the container
apptainer shell tensorflow_latest.sif
And then turn python3