Access cloud storage (S3, GCS, Azure) in Python using fsspec, pyarrow.fs, or obstore. Includes DataFrame integrations (Polars, DuckDB, Pandas, PyArrow), performance optimization, patterns for incremental loading, partitioned writes, and cross-cloud copy.
Exploratory data analysis and visualization: profiling datasets, choosing appropriate charts, applying statistical tests, and creating effective visualizations for insight communication. Use when understanding data structure, exploring distributions and relationships, selecting visualization libraries, or producing analysis-ready charts.
Data quality validation and observability for data pipelines. Combines Great Expectations and Pandera for data validation with OpenTelemetry and Prometheus for monitoring and alerting.
File formats and lakehouse table formats for data lakes: Parquet, Arrow, Lance, Zarr, Avro, ORC, Delta Lake, Apache Iceberg, and Apache Hudi. Covers compression, partitioning, ACID transactions, schema evolution, and format selection.
AI/ML production workflows: embedding generation, vector storage, RAG patterns, LLM monitoring, and batch inference.
Feature engineering for machine learning: encoding categorical variables, scaling numeric features, datetime transformations, text features, and leakage-safe preprocessing pipelines. Use when preparing data for modeling or improving model performance through better representations.
Model evaluation and validation: cross-validation strategies, metrics selection, hyperparameter tuning, experiment tracking, and model comparison. Use when assessing model performance, diagnosing issues, selecting models, or optimizing hyperparameters.
Create and manage data pipelines using the FlowerPower framework with Hamilton DAGs and uv. Lightweight orchestration for batch ETL, data transformation, and ML pipelines. Integrates with Delta Lake, DuckDB, Polars, and cloud storage.