تشغيل أي مهارة في Manus بنقرة واحدة

polars

النجوم١٦

التفرعات٢

آخر تحديث١ فبراير ٢٠٢٦ في ٠٤:٤١

Blazingly fast DataFrame library written in Rust. Features a multi-threaded query engine, lazy evaluation, and efficient memory usage via Apache Arrow. Designed for high-performance data processing on a single machine. Use for large datasets (1GB-100GB+), fast data transformations, Parquet/CSV processing, complex query pipelines, memory-efficient operations, and when speed is critical (10-100x faster than pandas).

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

tondevrel

tondevrel/scientific-agent-skills

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

علماء البياناتمهن الحاسوب والرياضيات·SOC 15-2051

SKILL.md

readonly

المزيد من هذا المستودع

نفس المستودع

ase

tondevrel/scientific-agent-skills

Atomic Simulation Environment - a set of tools for setting up, manipulating, running, visualizing, and analyzing atomistic simulations. Acts as a universal interface between Python and numerous quantum chemical and molecular dynamics codes. Use for building atomic structures, geometry optimization, molecular dynamics simulations, transition state searches (NEB), file format conversion (CIF, XYZ, POSCAR, PDB), electronic property calculations (DOS, band structures), and automating simulation workflows with DFT/MD codes like VASP, GPAW, Quantum ESPRESSO, LAMMPS.

2026-02-0116

astropy

tondevrel/scientific-agent-skills

The core library for Astronomy and Astrophysics in Python. Provides data structures for coordinates, time, units, FITS files, and cosmological models. Essential for observational data reduction and theoretical astrophysics. Use when working with astronomical coordinates (RA/Dec), physical units, FITS files, time scales, WCS, cosmology, or astronomical tables.

2026-02-0116

chempy

tondevrel/scientific-agent-skills

A Python package useful for chemistry (mainly physical/analytical/inorganic chemistry). Features include balancing chemical reactions, chemical kinetics (ODE integration), chemical equilibria, ionic strength calculations, and unit handling. Use when working with chemical equations, reaction balancing, kinetic modeling, equilibrium calculations, speciation, pH calculations, ionic strength, activity coefficients, or chemical formula parsing.

2026-02-0116

cobrapy

tondevrel/scientific-agent-skills

Constraints-Based Reconstruction and Analysis for Python. Used for modeling large-scale metabolic networks in microorganisms.

2026-02-0116

dask-optimization

tondevrel/scientific-agent-skills

Advanced sub-skill for Dask focused on distributed system performance, memory management, and task graph optimization. Covers cluster tuning, efficient serialization, data skew mitigation, and dashboard-driven debugging.

2026-02-0116

dask

tondevrel/scientific-agent-skills

A flexible library for parallel computing in Python. It scales Python libraries like NumPy, pandas, and scikit-learn to multi-core systems or distributed clusters. Features lazy evaluation and task scheduling for data that exceeds RAM capacity. Use for out-of-core computing, parallel processing, distributed computing, large-scale data analysis, dask.array, dask.dataframe, dask.delayed, dask.bag, task scheduling, lazy evaluation, and scaling beyond memory limits.

2026-02-0116

name	polars
description	Blazingly fast DataFrame library written in Rust. Features a multi-threaded query engine, lazy evaluation, and efficient memory usage via Apache Arrow. Designed for high-performance data processing on a single machine. Use for large datasets (1GB-100GB+), fast data transformations, Parquet/CSV processing, complex query pipelines, memory-efficient operations, and when speed is critical (10-100x faster than pandas).
version	0.2
license	MIT

Polars - High-Performance Dataframes

Polars is designed for speed. Unlike pandas, which processes data sequentially on a single CPU core, Polars parallelizes operations across all available cores. Its "Lazy API" allows it to optimize queries before execution, significantly reducing memory overhead and processing time.

When to Use

Processing large datasets (1GB - 100GB+) that struggle in pandas.
When execution speed is a priority (Polars is often 10-100x faster than pandas).
Working with complex data transformation pipelines (Lazy evaluation).
Systems with limited RAM (Polars is more memory-efficient than pandas).
Situations requiring strict type safety and consistent null handling.
Reading/writing large Parquet, CSV, or Avro files.

Reference Documentation

Official docs: https://docs.pola.rs/
User Guide: https://docs.pola.rs/user-guide/
Search patterns: pl.DataFrame, pl.LazyFrame, pl.col, df.select, df.filter, df.group_by

Core Principles

Eager vs. Lazy API

Eager: Operations are executed immediately (like pandas).
Lazy: Operations are queued into a query plan. Polars optimizes the plan (e.g., predicate pushdown, projection pushdown) and executes it only when called.

The Expression API

Polars uses a declarative syntax. Instead of writing loops or complex lambdas, you write expressions using pl.col(). These expressions are highly optimized and run in parallel.

Apache Arrow

Polars stores data in the Apache Arrow format, enabling zero-copy data exchange with other tools like PyArrow and DuckDB.

Quick Reference

Installation

pip install polars
# For Excel/Cloud support
pip install 'polars[all]'

Standard Imports

import polars as pl
import numpy as np

Basic Pattern - Lazy Workflow (The "Polars Way")

import polars as pl

# 1. Scan (Lazy) - doesn't load data yet
lf = pl.scan_csv("massive_data.csv")

# 2. Build Query Plan
query = (
    lf.filter(pl.col("age") > 25)
    .group_by("city")
    .agg([
        pl.col("salary").mean().alias("avg_salary"),
        pl.col("name").count().alias("count")
    ])
    .sort("avg_salary", descending=True)
)

# 3. Collect (Execute)
df = query.collect()

Critical Rules

✅ DO

Prefer Lazy API (scan_*) - This allows Polars to optimize memory and skip unnecessary data.
Use Expressions - Always use pl.col("name") instead of selecting columns via strings or indices.
Method Chaining - Polars is built for clean, readable pipelines.
Specify Schema - When reading CSVs, providing a schema prevents type inference errors and speeds up loading.
Use collect(streaming=True) - For datasets larger than RAM, streaming allows Polars to process data in chunks.
Parquet over CSV - Use Parquet for permanent storage; it is significantly faster and stores type information.

❌ DON'T

Avoid .apply() - Custom Python functions are slow because they break the Rust/parallel optimization. Use built-in expressions.
Don't use inplace=True - Polars (like JAX) favors immutability; transformations return new DataFrames.
Don't convert to pandas early - Keep data in Polars as long as possible to maintain speed.
Avoid Row Iteration - for row in df is an anti-pattern; use vectorized expressions.

Anti-Patterns (NEVER)

import polars as pl

# ❌ BAD: Using Python lambdas for simple math
# df.select(pl.col("val").map_elements(lambda x: x * 2)) # Slow!

# ✅ GOOD: Use expressions
df.select(pl.col("val") * 2) # Fast, parallelized in Rust

# ❌ BAD: Filtering after a heavy operation
# df.group_by("id").mean().filter(pl.col("id") == 5)

# ✅ GOOD: Lazy API will automatically "push down" the filter
(pl.scan_csv("data.csv")
 .filter(pl.col("id") == 5) # Optimized to read only id=5
 .group_by("id").mean())

# ❌ BAD: Converting to pandas just to check .head()
# df.to_pandas().head() 

# ✅ GOOD: Polars has its own fast .head() and rich printing
print(df.head())

Expression API Deep Dive

Selection and Transformation

df.select([
    pl.col("name"),
    pl.col("price") * 1.2, # Scalar math
    pl.col("category").str.to_uppercase(), # String methods
    pl.col("date").dt.year().alias("year") # Date methods
])

Filtering

# Multiple conditions
df.filter(
    (pl.col("price") < 100) & 
    (pl.col("status") == "active") |
    (pl.col("category").is_in(["A", "B"]))
)

Aggregation and Grouping

High-Performance Stats

results = df.group_by("department").agg([
    pl.col("salary").sum(),
    pl.col("salary").max().alias("max_pay"),
    pl.col("name").n_unique().alias("unique_employees"),
    # Advanced: conditional aggregation inside group
    pl.col("salary").filter(pl.col("role") == "manager").mean().alias("manager_avg")
])

Joins and Concatenation

SQL-like operations

# Joins: 'inner', 'left', 'outer', 'semi', 'anti', 'cross'
df_joined = df_a.join(df_b, on="id", how="left")

# As-of join (for time-series alignment)
df_aligned = df_trades.join_asof(df_quotes, on="timestamp", by="symbol")

# Concatenation
df_stacked = pl.concat([df1, df2], how="vertical")

Reshaping (Pivot and Melt)

# Pivot
pivoted = df.pivot(values="sales", index="date", columns="region", aggregate_function="sum")

# Melt (Unpivot)
melted = df.melt(id_vars="date", value_vars=["store_a", "store_b"])

Practical Workflows

1. Large-Scale Data Cleaning Pipeline

def clean_and_optimize(path):
    return (
        pl.scan_parquet(path)
        .drop_nulls(subset=["user_id"])
        .with_columns([
            pl.col("email").str.to_lowercase(),
            pl.col("timestamp").str.to_datetime("%Y-%m-%d %H:%M:%S"),
            (pl.col("income") / 1000).cast(pl.Float32) # Downcast for memory
        ])
        .filter(pl.col("timestamp") > pl.date(2023, 1, 1))
        .collect(streaming=True)
    )

2. Time-Series Feature Engineering

def engineer_features(df):
    return df.with_columns([
        # Rolling average
        pl.col("price").rolling_mean(window_size="7d", by="date").alias("rolling_7d"),
        # Lead/Lag
        pl.col("price").shift(1).alias("prev_price"),
        # Cumulative sum
        pl.col("sales").cum_sum().over("category")
    ])

3. Fast JSON/Log Parsing

def parse_logs(path):
    return (
        pl.scan_ndjson(path) # Read line-delimited JSON
        .select([
            "level",
            pl.col("message").str.extract(r"Error: (.*)", 1),
            pl.col("metadata").struct.field("user_id") # Access nested fields
        ])
        .collect()
    )

Performance Optimization

The Power of with_columns

Instead of creating one column at a time, use with_columns to run multiple calculations in parallel.

# All 3 columns are calculated simultaneously in different threads
df = df.with_columns([
    (pl.col("a") + pl.col("b")).alias("sum"),
    (pl.col("a") * pl.col("b")).alias("prod"),
    pl.col("c").str.len().alias("c_len")
])

Column Selection via Dtypes

Rapidly apply transformations to groups of columns.

# Multiply all float columns by 100
df = df.with_columns(
    pl.col(pl.Float64) * 100
)

Common Pitfalls and Solutions

The .apply() Trap

Python functions in .map_elements() (formerly .apply()) are slow.

# ❌ Problem: Using custom Python code
# df.select(pl.col("txt").map_elements(my_custom_func))

# ✅ Solution: Use Polars native expressions or pl.when()
df.select(
    pl.when(pl.col("score") > 50).then(pl.lit("Pass")).otherwise(pl.lit("Fail"))
)

Memory Errors on Large Files

If you hit OOM with .collect(), you might be trying to load too much data into memory.

# ✅ Solution: 
# 1. Use .filter() early in the Lazy plan.
# 2. Use streaming: .collect(streaming=True).
# 3. Select only the columns you need.

String vs Categorical

For low-cardinality strings (like "City" or "Gender"), use Categorical.

# ✅ Solution: Saves massive amounts of RAM and speeds up joins
df = df.with_columns(pl.col("category").cast(pl.Categorical))

Best Practices

Always use Lazy API for large files - Start with scan_csv() or scan_parquet() instead of read_csv() or read_parquet().
Build complete query plans before collecting - Let Polars optimize the entire pipeline.
Use expressions over Python functions - Leverage pl.col() expressions for maximum performance.
Specify schemas when reading CSVs - Prevents type inference overhead and errors.
Use streaming for out-of-memory datasets - Enable streaming=True in collect().
Prefer Parquet format - Faster reads/writes and preserves type information.
Cast to Categorical for low-cardinality strings - Significant memory and performance gains.
Use with_columns for multiple transformations - Parallelizes column creation.
Filter early in lazy queries - Predicate pushdown reduces data scanned.
Avoid converting to pandas - Stay in Polars ecosystem for maximum speed.

Polars is the new gold standard for single-node data processing. By combining the safety of Rust with the flexibility of Python, it provides a seamless and incredibly fast experience for modern data science.