| name | polars |
| description | High-performance data manipulation with lazy/eager execution, expressions, I/O (CSV, Parquet, JSON), aggregations, joins, string/datetime ops, and pandas interop. Covers performance optimization patterns and common anti-patterns. DAAF's default DataFrame library ā all pipeline code uses Polars, not pandas. Use for any DataFrame operation, reading/writing Parquet files, or migrating existing pandas code to Polars. |
| metadata | {"audience":"research-coders","domain":"python-library","library-version":"1.x","skill-last-updated":"2026-03-26"} |
Polars Skill
Polars DataFrame library for high-performance data manipulation in Python. Covers lazy/eager execution, expressions, I/O (CSV, Parquet, JSON, database), aggregations, joins, string/datetime operations, pandas/NumPy interop, and performance optimization. Use when working with Polars DataFrames, migrating from pandas, reading Parquet files, or optimizing data pipeline performance.
Comprehensive skill for high-performance data manipulation with Polars. Use decision trees below to find the right guidance, then load detailed references.
What is Polars?
Polars is a fast DataFrame library for Python (and Rust):
- Fast: Written in Rust, optimized for modern CPUs with SIMD and parallelism
- Lazy Evaluation: Build query plans that get optimized before execution
- Expressive: Powerful expression API for complex transformations
- Memory Efficient: Columnar format, streaming for larger-than-memory data
- No Dependencies: Pure Rust core, no NumPy/Pandas required
Version Notes
This skill targets Polars 1.x (tested with 1.37.1). Key changes from 0.x:
apply renamed to map_elements (0.19+)
groupby renamed to group_by (0.19+)
melt renamed to unpivot (1.0+)
- Streaming engine improvements in 1.x
pl.Utf8 is now pl.String (1.0+, Utf8 still works as alias)
How to Use This Skill
Reference File Structure
Each topic in ./references/ contains focused documentation:
| File | Purpose | When to Read |
|---|
quickstart.md | Installation, concepts, first DataFrame | Starting with Polars |
dataframes-series.md | Creation, selection, filtering, modification | Basic data manipulation |
io-data.md | CSV, Parquet, JSON, database I/O | Loading/saving data |
expressions.md | Expression system, contexts, chaining | Understanding Polars idioms |
aggregations-grouping.md | GroupBy, window functions, statistics | Summarizing data |
joins-concat.md | Joins, concatenation, pivot/unpivot | Combining DataFrames |
strings-datetime-categorical.md | String ops, datetime, categoricals | Type-specific operations |
performance.md | Lazy execution, optimization, anti-patterns | Making code faster |
interop.md | Pandas, NumPy, PyArrow, DuckDB | Working with other tools |
gotchas.md | Common errors, anti-patterns, migration | Debugging issues |
Reading Order
- New to Polars? Start with
quickstart.md then expressions.md
- Coming from Pandas? Read
quickstart.md, expressions.md, then interop.md
- Performance issues? Check
performance.md first
Quick Decision Trees
"I need to get started"
Getting started?
āā Install Polars ā ./references/quickstart.md
āā Create first DataFrame ā ./references/quickstart.md
āā Understand lazy vs eager ā ./references/quickstart.md
āā Learn expression syntax ā ./references/expressions.md
āā Coming from Pandas ā ./references/interop.md
"I need to load or save data"
Loading/saving data?
āā Read CSV file ā ./references/io-data.md
āā Read Parquet (recommended) ā ./references/io-data.md
āā Read JSON/NDJSON ā ./references/io-data.md
āā Read from database ā ./references/io-data.md
āā Read multiple files (glob) ā ./references/io-data.md
āā Write to file ā ./references/io-data.md
āā Larger-than-memory data ā ./references/performance.md
"I need to filter or select data"
Filtering/selecting?
āā Select columns by name ā ./references/dataframes-series.md
āā Select by pattern/regex ā ./references/dataframes-series.md
āā Select by data type ā ./references/dataframes-series.md
āā Filter rows by condition ā ./references/dataframes-series.md
āā Filter with multiple conditions ā ./references/dataframes-series.md
āā Handle null values ā ./references/dataframes-series.md
āā Add/modify columns ā ./references/dataframes-series.md
"I need to aggregate or group data"
Aggregating data?
āā Basic statistics (sum, mean, etc.) ā ./references/aggregations-grouping.md
āā Group by columns ā ./references/aggregations-grouping.md
āā Multiple aggregations ā ./references/aggregations-grouping.md
āā Window functions (over) ā ./references/aggregations-grouping.md
āā Rolling/moving averages ā ./references/aggregations-grouping.md
āā Cumulative operations ā ./references/aggregations-grouping.md
āā Ranking within groups ā ./references/aggregations-grouping.md
"I need to combine DataFrames"
Combining data?
āā Join two DataFrames ā ./references/joins-concat.md
āā Left/right/outer join ā ./references/joins-concat.md
āā Anti-join (not in) ā ./references/joins-concat.md
āā Concatenate vertically ā ./references/joins-concat.md
āā Pivot (long to wide) ā ./references/joins-concat.md
āā Unpivot/melt (wide to long) ā ./references/joins-concat.md
"I need better performance"
Performance issues?
āā Use lazy evaluation ā ./references/performance.md
āā Avoid row iteration ā ./references/performance.md
āā Reduce memory usage ā ./references/performance.md
āā Process large files ā ./references/performance.md
āā Optimize query plan ā ./references/performance.md
āā Common anti-patterns ā ./references/performance.md
"Something isn't working"
Having issues?
āā Type errors ā ./references/gotchas.md
āā Null handling ā ./references/gotchas.md
āā Expression context errors ā ./references/gotchas.md
āā String operations ā ./references/strings-datetime-categorical.md
āā Date parsing issues ā ./references/strings-datetime-categorical.md
āā Performance problems ā ./references/gotchas.md
āā Pandas migration issues ā ./references/gotchas.md
āā Memory errors ā ./references/gotchas.md
āā General troubleshooting ā ./references/gotchas.md
File-First Execution in Research Workflows
Important: In data research pipelines (see CLAUDE.md), Polars transformations are executed through script files, not interactively. This ensures auditability and reproducibility.
The pattern:
- Write transformation code to
scripts/stage{N}_{type}/{step}_{task-name}.py
- Execute via Bash with automatic output capture wrapper script
- Validation results get automatically embedded in scripts as comments
- If failed, create versioned copy for fixes
Closely read agent_reference/SCRIPT_EXECUTION_REFERENCE.md for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules.
See:
agent_reference/SCRIPT_EXECUTION_REFERENCE.md ā Script execution protocol and format with validation
The examples below show Polars syntax. In research workflows, wrap them in scripts following the file-first pattern.
Quick Reference
Essential Import
import polars as pl
import polars.selectors as cs
Lazy vs Eager (One-Liner)
df = pl.read_csv("data.csv")
lf = pl.scan_csv("data.csv")
df = lf.collect()
Core Expression Patterns
df.select("a", "b")
df.select(pl.col("a"), pl.col("b"))
df.select(pl.all().exclude("id"))
df.filter(pl.col("a") > 10)
df.filter((pl.col("a") > 10) & (pl.col("b") == "x"))
df.with_columns(
(pl.col("a") * 2).alias("a_doubled"),
pl.col("b").str.to_uppercase().alias("b_upper")
)
df.with_columns(
pl.when(pl.col("a") > 10)
.then(pl.lit("high"))
.otherwise(pl.lit("low"))
.alias("category")
)
df.group_by("category").agg(
pl.col("value").sum().alias("total"),
pl.col("value").mean().alias("average"),
pl.len().alias("count")
)
Essential Functions
| Function | Purpose |
|---|
pl.col("name") | Reference a column |
pl.lit(value) | Literal value |
pl.all() | All columns |
pl.exclude("col") | All except specified |
pl.len() | Row count |
pl.when().then().otherwise() | Conditional logic |
.alias("name") | Rename result |
.cast(pl.Int64) | Convert type |
Common Data Types
| Type | Description |
|---|
pl.Int64, pl.Int32 | Integers |
pl.Float64, pl.Float32 | Floats |
pl.String (or pl.Utf8) | Strings |
pl.Boolean | True/False |
pl.Date, pl.Datetime | Dates and timestamps |
pl.Duration | Time differences |
pl.Categorical | Categorical strings |
pl.List | List of values |
pl.Struct | Named fields |
Quick Cheatsheet
df = pl.read_csv/parquet/json("file")
lf = pl.scan_csv/parquet/ndjson("file")
df.write_csv/parquet/json("file")
df.select("a", "b")
df.select(cs.numeric())
df.filter(pl.col("a") > 1)
df.group_by("key").agg(pl.col("val").sum())
df1.join(df2, on="key", how="left")
df.sort("col", descending=True)
lf.collect()
lf.explain()
Topic Index
| Topic | Reference File |
|---|
| Installation | ./references/quickstart.md |
| DataFrame Creation | ./references/quickstart.md |
| Lazy vs Eager | ./references/quickstart.md |
| Column Selection | ./references/dataframes-series.md |
| Row Filtering | ./references/dataframes-series.md |
| Adding Columns | ./references/dataframes-series.md |
| CSV Files | ./references/io-data.md |
| Parquet Files | ./references/io-data.md |
| Database Connections | ./references/io-data.md |
| Expressions | ./references/expressions.md |
| Method Chaining | ./references/expressions.md |
| Contexts | ./references/expressions.md |
| GroupBy | ./references/aggregations-grouping.md |
| Window Functions | ./references/aggregations-grouping.md |
| Rolling Windows | ./references/aggregations-grouping.md |
| Joins | ./references/joins-concat.md |
| Concatenation | ./references/joins-concat.md |
| Pivot/Unpivot | ./references/joins-concat.md |
| String Operations | ./references/strings-datetime-categorical.md |
| Datetime Handling | ./references/strings-datetime-categorical.md |
| Categorical Data | ./references/strings-datetime-categorical.md |
| Query Optimization | ./references/performance.md |
| Memory Management | ./references/performance.md |
| Anti-Patterns | ./references/performance.md |
| Pandas Conversion | ./references/interop.md |
| NumPy Integration | ./references/interop.md |
| DuckDB Integration | ./references/interop.md |
| Type Errors | ./references/gotchas.md |
| qcut Label Gotcha | ./references/gotchas.md |
| Null Handling Issues | ./references/gotchas.md |
| Expression Context Errors | ./references/gotchas.md |
| Performance Anti-Patterns | ./references/gotchas.md |
| Migration from Pandas | ./references/gotchas.md |
| Memory Issues | ./references/gotchas.md |
Citation
When this library is used as a primary analytical tool, include in the report's
Software & Tools references:
Vink, R. et al. Polars: Blazingly fast DataFrames [Computer software]. https://pola.rs/
Cite when: Polars is the core data processing engine for the analysis (typically always true in DAAF pipelines).
Do not cite when: Only used for trivial file I/O in a script primarily using another tool.