Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

pandas-performance

Sterne16

Forks2

Aktualisiert1. Februar 2026 um 04:41

Advanced sub-skill for pandas focused on memory optimization, execution speed, and handling large-scale datasets (10M+ rows). Covers low-level dtypes, efficient indexing, and vectorization of complex logic.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

tondevrel

tondevrel/scientific-agent-skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

DatenwissenschaftlerInformatik- und Mathematikberufe·SOC 15-2051

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

ase

tondevrel/scientific-agent-skills

Atomic Simulation Environment - a set of tools for setting up, manipulating, running, visualizing, and analyzing atomistic simulations. Acts as a universal interface between Python and numerous quantum chemical and molecular dynamics codes. Use for building atomic structures, geometry optimization, molecular dynamics simulations, transition state searches (NEB), file format conversion (CIF, XYZ, POSCAR, PDB), electronic property calculations (DOS, band structures), and automating simulation workflows with DFT/MD codes like VASP, GPAW, Quantum ESPRESSO, LAMMPS.

2026-02-0116

astropy

tondevrel/scientific-agent-skills

The core library for Astronomy and Astrophysics in Python. Provides data structures for coordinates, time, units, FITS files, and cosmological models. Essential for observational data reduction and theoretical astrophysics. Use when working with astronomical coordinates (RA/Dec), physical units, FITS files, time scales, WCS, cosmology, or astronomical tables.

2026-02-0116

chempy

tondevrel/scientific-agent-skills

A Python package useful for chemistry (mainly physical/analytical/inorganic chemistry). Features include balancing chemical reactions, chemical kinetics (ODE integration), chemical equilibria, ionic strength calculations, and unit handling. Use when working with chemical equations, reaction balancing, kinetic modeling, equilibrium calculations, speciation, pH calculations, ionic strength, activity coefficients, or chemical formula parsing.

2026-02-0116

cobrapy

tondevrel/scientific-agent-skills

Constraints-Based Reconstruction and Analysis for Python. Used for modeling large-scale metabolic networks in microorganisms.

2026-02-0116

dask-optimization

tondevrel/scientific-agent-skills

Advanced sub-skill for Dask focused on distributed system performance, memory management, and task graph optimization. Covers cluster tuning, efficient serialization, data skew mitigation, and dashboard-driven debugging.

2026-02-0116

dask

tondevrel/scientific-agent-skills

A flexible library for parallel computing in Python. It scales Python libraries like NumPy, pandas, and scikit-learn to multi-core systems or distributed clusters. Features lazy evaluation and task scheduling for data that exceeds RAM capacity. Use for out-of-core computing, parallel processing, distributed computing, large-scale data analysis, dask.array, dask.dataframe, dask.delayed, dask.bag, task scheduling, lazy evaluation, and scaling beyond memory limits.

2026-02-0116

name	pandas-performance
description	Advanced sub-skill for pandas focused on memory optimization, execution speed, and handling large-scale datasets (10M+ rows). Covers low-level dtypes, efficient indexing, and vectorization of complex logic.
version	2.2
license	BSD-3-Clause

pandas - Performance & Memory Management

Standard pandas code is often memory-hungry and slow. This sub-skill provides the techniques to make pandas 10x faster and use 5x less RAM by understanding its internal architecture (BlockManager and Arrow backend).

When to Use

Your DataFrame is larger than 1GB and causes RAM pressure.
pd.read_csv is taking too long to load data.
Row-wise operations (apply, iterrows) are creating bottlenecks.
You need to perform complex joins or lookups on millions of rows.
Preparing data for high-performance ML models.

Reference Documentation

Official Performance Guide: https://pandas.pydata.org/docs/user_guide/enhancingperf.html
Scaling to Large Data: https://pandas.pydata.org/docs/user_guide/scale.html
Search patterns: df.memory_usage, pd.to_numeric(downcast=...), pd.Categorical, DataFrame.eval()

Core Principles

RAM is the Bottleneck

Pandas usually creates copies of data during operations. To handle large data, you must minimize copies and use the most efficient bit-width for your data types.

Vectorization vs. Loops

Level 1 (Best): Built-in NumPy/Pandas vectorized functions.
Level 2 (Good): df.eval() or df.query() for complex math.
Level 3 (Average): np.vectorize or df.apply() (only if logic is complex).
Level 4 (Worst): iterrows() or itertuples().

Memory Optimization Patterns

1. The "Downcasting" Workflow

Standard integer and float columns use 64 bits by default. Most scientific data fits in 16 or 32 bits.

import pandas as pd
import numpy as np

def optimize_memory(df):
    start_mem = df.memory_usage().sum() / 1024**2
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
        else:
            # Convert low-cardinality strings to Categorical
            num_unique = df[col].nunique()
            if num_unique / len(df) < 0.5:
                df[col] = df[col].astype('category')
                
    end_mem = df.memory_usage().sum() / 1024**2
    print(f'Memory reduced by {100 * (start_mem - end_mem) / start_mem:.1f}%')
    return df

2. Modern PyArrow Backend (Pandas 2.0+)

Use the Arrow backend for massive speedups in string operations and faster loading.

# Load with Arrow engine for 2-3x speedup
df = pd.read_csv("data.csv", engine="pyarrow", dtype_backend="pyarrow")

Speed Optimization Patterns

1. Vectorizing Complex "If-Else" (Instead of .apply)

Instead of calling a Python function for every row:

# ❌ SLOW:
# df['status'] = df.apply(lambda x: 'High' if x['val'] > 100 else 'Low', axis=1)

# ✅ FAST:
df['status'] = np.where(df['val'] > 100, 'High', 'Low')

# ✅ FAST (Multiple conditions):
conditions = [
    (df['val'] > 100),
    (df['val'] > 50) & (df['val'] <= 100),
    (df['val'] <= 50)
]
choices = ['High', 'Medium', 'Low']
df['status'] = np.select(conditions, choices, default='Unknown')

2. High-Speed Lookups

If you need to map values from a dictionary/other table millions of times:

# ❌ SLOW: df.merge() or df['id'].map(large_dict)
# ✅ FAST: Use a Series as a lookup table with index
lookup_table = pd.Series(data=values, index=keys)
result = lookup_table.reindex(df['target_ids']).values

Efficient I/O

1. Parquet with Filtering (Predicate Pushdown)

Never use CSV for large data storage. Use Parquet.

# Save as partitioned parquet
df.to_parquet('data_dir', partition_cols=['year', 'month'])

# Load only specific columns and rows (Fast)
df_subset = pd.read_parquet('data_dir', columns=['price', 'id'], 
                            filters=[('year', '==', 2023)])

2. Chunking for Memory-Limited Systems

If the file is 50GB and you have 16GB RAM:

# Process in chunks of 100k rows
chunk_size = 100_000
for chunk in pd.read_csv("massive.csv", chunksize=chunk_size):
    # Perform aggregation
    summary = chunk.groupby('id')['value'].sum()
    # Save or update a running total

Critical Rules for Performance

✅ DO

Use In-place operations sparingly - Contrary to myth, inplace=True often creates internal copies anyway. Focus on dtypes instead.
Sort Index for Slicing - If you slice a large DataFrame by index, ensure it is sorted: df.sort_index(inplace=True). This turns an O(N) operation into O(log N).
Use pd.to_datetime with format - Specifying the format (%Y-%m-%d) is much faster than automatic parsing.
Leverage .eval() - For complex arithmetic like (A + B) / (C * D), df.eval() is faster and more memory-efficient as it uses numexpr.

❌ DON'T

Never iterate with iterrows() - It converts each row into a Series object, which is incredibly slow.
Avoid object dtypes - Any column with object dtype (usually strings) is a pointer to a Python object, which is memory-intensive. Use category or string[pyarrow].
Don't use append() in a loop - It creates a full copy of the DataFrame every time. Collect data in a list and use pd.concat().

Anti-Patterns (NEVER)

# ❌ BAD: Growing a DataFrame row by row
df = pd.DataFrame()
for data in large_source:
    df = pd.concat([df, pd.DataFrame([data])]) # ❌ Disaster for performance!

# ✅ GOOD: List of dicts to DataFrame
data_list = []
for data in large_source:
    data_list.append(data)
df = pd.DataFrame(data_list)

# ❌ BAD: Manual string formatting
# df['name'].apply(lambda x: f"USER_{x}")

# ✅ GOOD: Vectorized string accessor
df['name'] = "USER_" + df['name'].astype(str)

Practical Workflows

1. Identifying Memory Hogs

# Get detailed memory breakdown (including object overhead)
print(df.memory_usage(deep=True))

# Identify columns with too many unique strings (bad for 'category')
for col in df.select_dtypes(include=['object']):
    print(f"{col}: {df[col].nunique() / len(df):.2%}")

2. Fast Deduplication of 10M+ Rows

# Using sorting + shift is often faster than drop_duplicates
df = df.sort_values(['id', 'timestamp'])
mask = (df['id'] != df['id'].shift())
df_unique = df[mask]

3. Merging with Multi-Index

# If you join on multiple columns, setting them as an index 
# and using join() can be 5x faster than merge()
df1.set_index(['key1', 'key2'], inplace=True)
df2.set_index(['key1', 'key2'], inplace=True)
result = df1.join(df2, how='inner')

This sub-skill turns pandas from a prototyping tool into a high-performance engine capable of handling industrial-scale scientific data.