with one click
accelerated-computing-cudf
// Official NVIDIA-authored guidance for NVIDIA cuDF GPU DataFrames, pandas acceleration, dask-cuDF, ETL, joins, groupby, CSV/Parquet I/O, nullable semantics, and multi-GPU DataFrame workloads.
// Official NVIDIA-authored guidance for NVIDIA cuDF GPU DataFrames, pandas acceleration, dask-cuDF, ETL, joins, groupby, CSV/Parquet I/O, nullable semantics, and multi-GPU DataFrame workloads.
Debug and fix pandas test suite failures under the cudf.pandas compatibility layer. Use when given pytest node IDs of failing pandas tests that need to be fixed for cudf.pandas compatibility.
Use this skill to review GitHub pull requests for cudf
Build and test cudf Java bindings (cudf-java) inside a cudf devcontainer. Use when the user asks to build, compile, or test Java code in the cudf repository.
Use this skill to build and test code changes inside a cudf devcontainer.
| name | accelerated-computing-cudf |
| description | Official NVIDIA-authored guidance for NVIDIA cuDF GPU DataFrames, pandas acceleration, dask-cuDF, ETL, joins, groupby, CSV/Parquet I/O, nullable semantics, and multi-GPU DataFrame workloads. |
| license | CC-BY-4.0 AND Apache-2.0 |
| metadata | {"author":"NVIDIA","tags":["cudf","dataframes","pandas","dask-cudf","etl"]} |
Use NVIDIA library-first wording in user-facing answers. Keep literal RAPIDS/rapidsai URLs, package names, and release metadata when citing sources.
You are a cuDF expert helping an implementer work with GPU DataFrames. The user understands pandas and their data — your job is to get them to correct, fast GPU code with minimal friction. Choose the path from the user's intent: cudf.pandas for broad compatibility or minimal-change acceleration, explicit cuDF for named DataFrame migrations, hot ETL paths, and parity-sensitive work. Treat source schema, row counts, null placement, ordering, and numeric tolerances as user-visible behavior.
cudf.pandas for broad compatibility or minimal-change acceleration. Use explicit cuDF when the user asks to migrate DataFrame code, inspect parity, optimize a visible ETL hot path, or control unsupported operations..to_pandas(), .values, or .numpy() for display, plotting, CPU-only libraries, or final output boundaries. Keep intermediate ETL data on GPU.enable_cudf_spill=True. See references/dask-cudf-patterns.md.Use when the user needs a small code change, third-party pandas compatibility, or one code path that can keep running while unsupported operations fall back.
Jupyter/IPython:
%load_ext cudf.pandas
import pandas as pd # now GPU-backed; falls back silently for unsupported ops
Script:
python -m cudf.pandas my_script.py
With multiprocessing:
import cudf.pandas
cudf.pandas.install() # must come BEFORE pandas import, before Pool creation
from multiprocessing import Pool
Confirm acceleration with the cudf.pandas profiler before claiming speedup.
For notebook, CLI, and stats examples, read
references/cudf-pandas-accelerator.md. If the profile shows the hot path
running on CPU, use Path 2 for explicit cuDF control.
For full control, hot-path optimization, named DataFrame migrations, and parity-sensitive operations:
import cudf
# Read data directly to GPU
df = cudf.read_parquet("data.parquet")
# Operations mirror pandas
result = df.groupby("key")["value"].sum()
merged = df.merge(lookup, on="id", how="left")
filtered = df[df["amount"] > 1000]
# String operations
df["clean"] = df["name"].str.strip().str.lower()
# To check API coverage before committing to migration:
# See references/api-patterns.md for known gaps and workarounds
Keep data on GPU end-to-end. Only call .to_pandas() at the very end for display or CPU or non-GPU handoff.
Prefer explicit cuDF for tasks involving read_csv/read_parquet, joins,
groupby, reshape, nullable types, fillna/where, time buckets, rolling
windows, or CPU/GPU parity checks. Add a small CPU/GPU validation path when
semantics matter instead of relying on successful execution alone.
For pandas code with null handling, reshape, or time-series behavior, read
references/api-patterns.md for the relevant semantic checklist before
rewriting. A cudf.pandas bootstrap is enough for a minimal-change request; an
implementation request should make the hot path explicit and observable.
For reshape-heavy pandas code (pivot_table, melt, stack/unstack,
crosstab), keep the source schema as part of the contract: index labels,
column labels or levels, fill_value, aggfunc, margins, and normalization.
Use explicit cuDF where the equivalent is supported; use cudf.pandas or a
narrow compatibility boundary when exact pandas reshape semantics matter more
than rewriting every operation. Add a small pandas-reference parity check for
shape, labels, and representative values before finalizing. See
references/api-patterns.md.
When dataset exceeds GPU memory. See references/dask-cudf-patterns.md for full patterns.
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask_cudf
cluster = LocalCUDACluster(enable_cudf_spill=True) # one worker per GPU
client = Client(cluster)
ddf = dask_cudf.read_parquet("s3://bucket/data/*.parquet")
result = ddf.groupby("key").agg({"value": "sum"}).compute()
Enable spill before OOM happens (not after):
import cudf
cudf.set_option("spill", True) # spill to host RAM when GPU is full
RMM pool allocator (reduces cudaMalloc overhead in pipelines with many allocations):
import rmm
rmm.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
# Must be called BEFORE any cuDF operations
| GPU Free vs Dataset | Strategy |
|---|---|
| Free > 2× dataset | Single GPU cuDF |
| Free 1–2× dataset | cuDF + cudf.set_option("spill", True) |
| Dataset > GPU mem | dask-cuDF |
| Dataset > node mem | dask-cuDF + multi-node (see accelerated-computing-mpf) |
No speedup vs pandas:
%%cudf.pandas.profile — high CPU % means many fallbacks. Identify and fix those ops.references/api-patterns.md for known gaps.OOM (CUDA out of memory):
cudf.set_option("spill", True)accelerated-computing-rmm memory-resource setup guidance before GPU allocationsAttributeError / NotImplementedError:
references/api-patterns.md for the specific operation.to_pandas() only for the unsupported op, then .from_pandas() backWrong results vs pandas:
<NA> (nullable) by default, pandas uses NaN. See references/api-patterns.md.stable=True is passedfloat64 instead of float32). If the results are still different, stop. GPU and CPU algorithms will always produce different results on floating point numbers due to the non-associativity of floating point arithmetic and that cannot be fixed.When the user explicitly cares about pandas nullable dtypes, fillna,
where/mask, or grouped null behavior, treat parity checks as part of the
implementation. See references/api-patterns.md for nullable dtype examples.
where/mask semantics when they encode a condition. Use broad
fillna only when the condition is exactly null-only.to_pandas(nullable=True) when the pandas reference uses
nullable extension dtypes.references/cudf-pandas-accelerator.md — Profiling, fallback detection, cudf.pandas deep divereferences/api-patterns.md — Known API gaps, workarounds, semantic differencesreferences/dask-cudf-patterns.md — Multi-GPU patterns, best practices, partition tuningUse WebFetch to retrieve detailed API signatures, parameter descriptions, and examples on demand.