| name | unc-hpc |
| description | Run computational jobs on UNC Chapel Hill's HPC clusters (Longleaf and Sycamore). Use when a user wants to: submit SLURM batch jobs, set up Python/conda environments for ML/deep learning, run PyTorch or TensorFlow training on GPUs, run Hugging Face model inference or fine-tuning (including Unsloth for fast QLoRA), use Jupyter notebooks on the cluster, transfer data, manage storage, monitor GPU utilization, write SLURM scripts, do hyperparameter sweeps, or anything involving UNC Research Computing infrastructure. Covers both Longleaf (high-throughput, broad GPU selection) and Sycamore (HPC, H100 GPUs). |
UNC HPC Clusters
Guide for running ML/computational jobs on UNC Chapel Hill's Longleaf and Sycamore clusters.
Cluster Selection
-
Determine which cluster fits the workload:
Longleaf (longleaf.unc.edu) — Choose when:
- Running single-node GPU jobs (most ML training)
- Need interactive Jupyter via Open OnDemand (
ondemand.rc.unc.edu)
- Want broadest GPU selection (L40S 48GB, A100 40GB, V100 16GB, GTX 1080 8GB)
- Running many independent jobs (sweeps, preprocessing)
Sycamore (sycamore.unc.edu) — Choose when:
- Need H100 GPUs (80GB HBM3) for large model training
- Running multi-node distributed training (InfiniBand NDR)
- Need massive CPU parallelism (192 cores/node, 1.5TB RAM/node)
- Running tightly coupled MPI workloads
-
See cluster-specific details:
Core Workflow
All jobs on both clusters follow this pattern:
1. Connect
ssh <onyen>@longleaf.unc.edu
Longleaf also has web access: https://ondemand.rc.unc.edu (Jupyter, RStudio, shell, file browser). Sycamore is CLI-only.
2. Set Up Environment
module purge
module load anaconda/2024.02
conda create --name ml python=3.12
conda activate ml
conda install -c pytorch -c nvidia pytorch torchvision torchaudio pytorch-cuda=12.1
python -m ipykernel install --user --name=ml
3. Stage Data
Place training data on /work (high-throughput SSD storage, shared across both clusters):
rsync -avz ./data/ <onyen>@rc-dm.its.unc.edu:/work/users/x/y/<onyen>/data/
Use data mover nodes (rc-dm.its.unc.edu) for transfers, not login nodes. Use Globus for large transfers (>10 min).
4. Write and Submit SLURM Script
Minimal GPU job on Longleaf:
#!/bin/bash
module purge
module load anaconda/2024.02
conda activate ml
python train.py
Minimal GPU job on Sycamore:
#!/bin/bash
module purge
module load anaconda/2024.02
conda activate ml
python train.py
Key differences: Longleaf uses --gres=gpu:N + --qos=gpu_access; Sycamore uses --gpus=N (no qos needed for h100_sn).
Submit: sbatch train.sl
5. Monitor
squeue -u <onyen>
seff <jobid>
ssh <nodename>
nvidia-smi
nvtop
Key Differences at a Glance
| Longleaf | Sycamore |
|---|
| SSH host | longleaf.unc.edu | sycamore.unc.edu |
| Web portal | OnDemand (ondemand.rc.unc.edu) | None |
| Best GPUs | L40S (48GB), A100 (40GB) | H100 (80GB) |
| GPU request | --gres=gpu:N + --qos=gpu_access | --gpus=N |
| Min cores (default) | 1 | 48 (use -p small for <48) |
| Multi-node GPU | No | Yes (InfiniBand, h100_mn) |
| Max walltime | 11 days (6 for A100) | 5 days |
Storage (Shared Across Both Clusters)
| Path | Quota | Backed Up | Notes |
|---|
/nas/longleaf/home/<onyen> | 50GB | Yes | Scripts, configs. Avoid heavy I/O. |
/users/<o>/<n>/<onyen> | 10TB | No | Inactive datasets, capacity expansion |
/work/users/<o>/<n>/<onyen> | 10TB | No | Active computation data. Use this for training. |
/pine/scr/<o>/<n>/<onyen> | 30TB | No | Scratch. 36-day purge policy. |
/proj/<labname> | 1TB+ | No | PI shared space (request via research@unc.edu) |
<o> and <n> are first and second chars of your ONYEN.
Quick GPU Selection Guide
| Task | Recommended GPU | Cluster | Partition |
|---|
| Quick test / small model | GTX 1080 (8GB) | Longleaf | gpu |
| Standard training (FP32) | L40S (48GB) | Longleaf | l40-gpu |
| Large model / need FP64 | A100 (40GB) | Longleaf | a100-gpu |
| Largest models / fastest | H100 (80GB) | Sycamore | h100_sn |
| Multi-node distributed | H100 x8 (640GB) | Sycamore | h100_mn |
| Interactive Jupyter w/ GPU | A100 MIG slice | Longleaf | OnDemand |
Common Pitfalls
- Never run jobs on login nodes — always use
sbatch or srun
- Don't train from home directory — use
/work for data I/O
- Don't transfer via login nodes — use
rc-dm.its.unc.edu
- Start with 1 GPU — multi-GPU often underutilizes without proper DDP setup
- Don't mix pip and conda extensively — install conda packages first, pip last
- Always
module purge at top of SLURM scripts — prevents environment conflicts
- venv won't work with OnDemand Jupyter — use conda for Jupyter kernels
- Scratch purges after 36 days — don't store important results on
/pine/scr
- Sycamore
batch partition minimum is 48 cores — use -p small for smaller jobs
- HF models fill home directory — set
HF_HOME=/work/users/<o>/<n>/<onyen>/hf_cache to avoid 50GB quota
Getting Help
- Email:
research@unc.edu
- Docs:
https://help.rc.unc.edu/
- Account requests:
https://tdx.unc.edu/TDClient/33/Portal/Requests/ServiceDet?ID=45
- Quota check:
https://service.rc.unc.edu/ or run quota