| name | polars |
| description | Blazingly fast DataFrame library written in Rust. Features a multi-threaded query engine, lazy evaluation, and efficient memory usage via Apache Arrow. Designed for high-performance data processing on a single machine. Use for large datasets (1GB-100GB+), fast data transformations, Parquet/CSV processing, complex query pipelines, memory-efficient operations, and when speed is critical (10-100x faster than pandas). |
| version | 0.2 |
| license | MIT |
Polars - High-Performance Dataframes
Polars is designed for speed. Unlike pandas, which processes data sequentially on a single CPU core, Polars parallelizes operations across all available cores. Its "Lazy API" allows it to optimize queries before execution, significantly reducing memory overhead and processing time.
When to Use
- Processing large datasets (1GB - 100GB+) that struggle in pandas.
- When execution speed is a priority (Polars is often 10-100x faster than pandas).
- Working with complex data transformation pipelines (Lazy evaluation).
- Systems with limited RAM (Polars is more memory-efficient than pandas).
- Situations requiring strict type safety and consistent null handling.
- Reading/writing large Parquet, CSV, or Avro files.
Reference Documentation
Official docs: https://docs.pola.rs/
User Guide: https://docs.pola.rs/user-guide/
Search patterns: pl.DataFrame, pl.LazyFrame, pl.col, df.select, df.filter, df.group_by
Core Principles
Eager vs. Lazy API
- Eager: Operations are executed immediately (like pandas).
- Lazy: Operations are queued into a query plan. Polars optimizes the plan (e.g., predicate pushdown, projection pushdown) and executes it only when called.
The Expression API
Polars uses a declarative syntax. Instead of writing loops or complex lambdas, you write expressions using pl.col(). These expressions are highly optimized and run in parallel.
Apache Arrow
Polars stores data in the Apache Arrow format, enabling zero-copy data exchange with other tools like PyArrow and DuckDB.
Quick Reference
Installation
pip install polars
pip install 'polars[all]'
Standard Imports
import polars as pl
import numpy as np
Basic Pattern - Lazy Workflow (The "Polars Way")
import polars as pl
lf = pl.scan_csv("massive_data.csv")
query = (
lf.filter(pl.col("age") > 25)
.group_by("city")
.agg([
pl.col("salary").mean().alias("avg_salary"),
pl.col("name").count().alias("count")
])
.sort("avg_salary", descending=True)
)
df = query.collect()
Critical Rules
✅ DO
- Prefer Lazy API (scan_*) - This allows Polars to optimize memory and skip unnecessary data.
- Use Expressions - Always use
pl.col("name") instead of selecting columns via strings or indices.
- Method Chaining - Polars is built for clean, readable pipelines.
- Specify Schema - When reading CSVs, providing a schema prevents type inference errors and speeds up loading.
- Use collect(streaming=True) - For datasets larger than RAM, streaming allows Polars to process data in chunks.
- Parquet over CSV - Use Parquet for permanent storage; it is significantly faster and stores type information.
❌ DON'T
- Avoid .apply() - Custom Python functions are slow because they break the Rust/parallel optimization. Use built-in expressions.
- Don't use inplace=True - Polars (like JAX) favors immutability; transformations return new DataFrames.
- Don't convert to pandas early - Keep data in Polars as long as possible to maintain speed.
- Avoid Row Iteration -
for row in df is an anti-pattern; use vectorized expressions.
Anti-Patterns (NEVER)
import polars as pl
df.select(pl.col("val") * 2)
(pl.scan_csv("data.csv")
.filter(pl.col("id") == 5)
.group_by("id").mean())
print(df.head())
Expression API Deep Dive
Selection and Transformation
df.select([
pl.col("name"),
pl.col("price") * 1.2,
pl.col("category").str.to_uppercase(),
pl.col("date").dt.year().alias("year")
])
Filtering
df.filter(
(pl.col("price") < 100) &
(pl.col("status") == "active") |
(pl.col("category").is_in(["A", "B"]))
)
Aggregation and Grouping
High-Performance Stats
results = df.group_by("department").agg([
pl.col("salary").sum(),
pl.col("salary").max().alias("max_pay"),
pl.col("name").n_unique().alias("unique_employees"),
pl.col("salary").filter(pl.col("role") == "manager").mean().alias("manager_avg")
])
Joins and Concatenation
SQL-like operations
df_joined = df_a.join(df_b, on="id", how="left")
df_aligned = df_trades.join_asof(df_quotes, on="timestamp", by="symbol")
df_stacked = pl.concat([df1, df2], how="vertical")
Reshaping (Pivot and Melt)
pivoted = df.pivot(values="sales", index="date", columns="region", aggregate_function="sum")
melted = df.melt(id_vars="date", value_vars=["store_a", "store_b"])
Practical Workflows
1. Large-Scale Data Cleaning Pipeline
def clean_and_optimize(path):
return (
pl.scan_parquet(path)
.drop_nulls(subset=["user_id"])
.with_columns([
pl.col("email").str.to_lowercase(),
pl.col("timestamp").str.to_datetime("%Y-%m-%d %H:%M:%S"),
(pl.col("income") / 1000).cast(pl.Float32)
])
.filter(pl.col("timestamp") > pl.date(2023, 1, 1))
.collect(streaming=True)
)
2. Time-Series Feature Engineering
def engineer_features(df):
return df.with_columns([
pl.col("price").rolling_mean(window_size="7d", by="date").alias("rolling_7d"),
pl.col("price").shift(1).alias("prev_price"),
pl.col("sales").cum_sum().over("category")
])
3. Fast JSON/Log Parsing
def parse_logs(path):
return (
pl.scan_ndjson(path)
.select([
"level",
pl.col("message").str.extract(r"Error: (.*)", 1),
pl.col("metadata").struct.field("user_id")
])
.collect()
)
Performance Optimization
The Power of with_columns
Instead of creating one column at a time, use with_columns to run multiple calculations in parallel.
df = df.with_columns([
(pl.col("a") + pl.col("b")).alias("sum"),
(pl.col("a") * pl.col("b")).alias("prod"),
pl.col("c").str.len().alias("c_len")
])
Column Selection via Dtypes
Rapidly apply transformations to groups of columns.
df = df.with_columns(
pl.col(pl.Float64) * 100
)
Common Pitfalls and Solutions
The .apply() Trap
Python functions in .map_elements() (formerly .apply()) are slow.
df.select(
pl.when(pl.col("score") > 50).then(pl.lit("Pass")).otherwise(pl.lit("Fail"))
)
Memory Errors on Large Files
If you hit OOM with .collect(), you might be trying to load too much data into memory.
String vs Categorical
For low-cardinality strings (like "City" or "Gender"), use Categorical.
df = df.with_columns(pl.col("category").cast(pl.Categorical))
Best Practices
- Always use Lazy API for large files - Start with
scan_csv() or scan_parquet() instead of read_csv() or read_parquet().
- Build complete query plans before collecting - Let Polars optimize the entire pipeline.
- Use expressions over Python functions - Leverage
pl.col() expressions for maximum performance.
- Specify schemas when reading CSVs - Prevents type inference overhead and errors.
- Use streaming for out-of-memory datasets - Enable
streaming=True in collect().
- Prefer Parquet format - Faster reads/writes and preserves type information.
- Cast to Categorical for low-cardinality strings - Significant memory and performance gains.
- Use with_columns for multiple transformations - Parallelizes column creation.
- Filter early in lazy queries - Predicate pushdown reduces data scanned.
- Avoid converting to pandas - Stay in Polars ecosystem for maximum speed.
Polars is the new gold standard for single-node data processing. By combining the safety of Rust with the flexibility of Python, it provides a seamless and incredibly fast experience for modern data science.