| name | accessing-cloud-storage |
| description | Access cloud storage (S3, GCS, Azure) in Python using fsspec, pyarrow.fs, or obstore. Includes DataFrame integrations (Polars, DuckDB, Pandas, PyArrow), performance optimization, patterns for incremental loading, partitioned writes, and cross-cloud copy. |
| dependsOn | ["@building-data-pipelines","@assuring-data-pipelines","@designing-data-storage"] |
Accessing Cloud Storage
Comprehensive guide to accessing cloud storage (S3, GCS, Azure) and remote filesystems in Python. Covers three major libraries - fsspec, pyarrow.fs, and obstore - and their integration with data engineering tools.
Quick Comparison
| Feature | fsspec | pyarrow.fs | obstore |
|---|
| Best For | Broad compatibility, ecosystem integration | Arrow-native workflows, Parquet | High-throughput, performance-critical |
| Backends | S3, GCS, Azure, HTTP, FTP, 20+ more | S3, GCS, HDFS, local | S3, GCS, Azure, local |
| Performance | Good (with caching) | Excellent for Parquet | 9x faster for concurrent ops |
| Dependencies | Backend-specific (s3fs, gcsfs) | Bundled with PyArrow | Zero Python deps (Rust) |
| Async Support | Yes (aiohttp) | Limited | Native sync/async |
| DataFrame Integration | Universal | PyArrow-native | Via fsspec wrapper |
| Maturity | Very mature (2018+) | Mature | New (2025), rapidly evolving |
When to Use Which?
Use fsspec when:
- You need broad ecosystem compatibility (pandas, xarray, Dask)
- Working with multiple storage backends (S3, GCS, Azure, HTTP)
- You need protocol chaining and caching features
- Your workflow involves diverse data formats beyond Parquet
Use pyarrow.fs when:
- Your pipeline is Arrow/Parquet-native
- You need zero-copy integration with PyArrow datasets
- Predicate pushdown and column pruning are critical
- Working with partitioned Parquet datasets
Use obstore when:
- Performance is paramount (many small files, high concurrency)
- You need async/await support for concurrent operations
- You want minimal dependencies (Rust-based)
- Working with large-scale data ingestion/egestion
Skill Dependencies
Prerequisites:
@building-data-pipelines - Polars, DuckDB, PyArrow basics
- AWS, GCP, Azure auth patterns (see Authentication section below)
@designing-data-storage - File formats (Parquet, Arrow, Lance) and lakehouse formats (Delta Lake, Iceberg, Hudi)
Related:
@orchestrating-data-pipelines - dbt with cloud storage
Detailed Guides
Library Deep Dives
This skill contains detailed guidance for all three libraries:
DataFrame Integrations
- Polars - Native
s3://, gs://, az:// URIs with lazy evaluation and predicate pushdown
- DuckDB - HTTPFS extension for SQL queries directly on remote Parquet/JSON/CSV
- Pandas - fsspec auto-detection for transparent cloud URI handling
- PyArrow - Native filesystem with dataset scanning and batch processing
For detailed patterns, see DataFrame Integration below. For Delta Lake and Iceberg table formats on cloud storage:
@designing-data-storage - Delta Lake and Iceberg with cloud catalogs (S3/GCS/Azure)
Infrastructure Patterns
- AWS, GCP, Azure auth patterns, IAM roles, service principals (see Authentication section below)
- See
performance.md in this skill - Caching, concurrency, async
- See
patterns.md in this skill - Incremental loading, partitioned writes, cross-cloud copy
Storage Formats
@designing-data-storage - Parquet, Arrow/Feather, Lance, Zarr, Avro, ORC
Quick Start Example
Library Approaches
import fsspec
import pyarrow.fs as fs
import pyarrow.parquet as pq
import obstore as obs
s3_fs = fsspec.filesystem('s3')
with s3_fs.open('s3://bucket/data.parquet', 'rb') as f:
data = f.read()
s3_pa = fs.S3FileSystem(region='us-east-1')
table = pq.read_table("bucket/data.parquet", filesystem=s3_pa)
from obstore.store import S3Store
store = S3Store(bucket='my-bucket', region='us-east-1')
data = obs.get(store, 'data.parquet').bytes()
DataFrame Approaches
import polars as pl
import duckdb
df = pl.read_parquet("s3://bucket/data.parquet")
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
df = con.sql("SELECT * FROM read_parquet('s3://bucket/data.parquet')").pl()
Library Guides
fsspec Library Guide
fsspec provides a unified API for local and remote filesystems, integrating seamlessly with pandas, xarray, Dask, and many other Python data tools.
Installation
pip install fsspec
pip install fsspec[s3]
pip install fsspec[gcs]
pip install fsspec[s3,gcs,azure]
pip install s3fs gcsfs adlfs
Basic Usage
import fsspec
import pandas as pd
print(fsspec.available_protocols())
local_fs = fsspec.filesystem('file')
s3_fs = fsspec.filesystem('s3', anon=False)
gcs_fs = fsspec.filesystem('gcs')
s3_fs.ls('my-bucket/data/')
s3_fs.exists('s3://my-bucket/data/file.csv')
s3_fs.mkdir('my-bucket/new-folder')
with s3_fs.open('s3://my-bucket/data/file.txt', 'rb') as f:
content = f.read()
with s3_fs.open('s3://my-bucket/data/large.csv', 'rb') as f:
df = pd.read_csv(f, compression='gzip')
Protocol Chaining & Caching
import fsspec
cached_file = fsspec.open_local(
"simplecache::s3://my-bucket/large-file.nc",
simplecache={'cache_storage': '/tmp/fsspec_cache', 'compression': None}
)
with fsspec.open(
"simplecache::gzip::https://example.com/data.csv.gz",
compression='gzip'
) as f:
df = pd.read_csv(f)
Advanced S3 Features
import s3fs
fs = s3fs.S3FileSystem(
key='AKIA...',
secret='...',
token='...',
client_kwargs={
'region_name': 'us-east-1',
'endpoint_url': 'https://s3-compatible.local',
},
config_kwargs={
'max_pool_connections': 50,
'retries': {'max_attempts': 5}
},
skip_instance_cache=True
)
import asyncio
async def read_multiple():
fs = s3fs.S3FileSystem(asynchronous=True)
await fs.set_session()
data = await asyncio.gather(
fs._cat_file('bucket/file1.parquet'),
fs._cat_file('bucket/file2.parquet'),
fs._cat_file('bucket/file3.parquet')
)
return data
fs.find('my-bucket', prefix='data/2024')
fs.du('my-bucket/data')
fs.rm('my-bucket/temp/', recursive=True)
When to Use fsspec
Choose fsspec when:
- You need broad ecosystem compatibility (pandas, xarray, Dask)
- Working with multiple storage backends (S3, GCS, Azure, HTTP)
- You need protocol chaining and caching features
- Your workflow involves diverse data formats beyond Parquet
Performance Considerations
- ✅ Use
filecache:: instead of simplecache:: for persistent caching across sessions
- ✅ Increase
max_pool_connections for high concurrency
- ✅ Use async API for many concurrent small file operations
- ⚠️ For pure Parquet workflows with high throughput, consider
pyarrow.fs instead
- ⚠️ For maximum performance on large concurrent operations, consider
obstore
PyArrow Filesystem Guide
PyArrow provides native filesystem integration optimized for Arrow and Parquet workflows.
Installation
pip install pyarrow
Basic Usage
import pyarrow.fs as fs
import pyarrow.parquet as pq
import pyarrow.dataset as ds
s3_fs = fs.S3FileSystem(region='us-east-1')
gcs_fs = fs.GcsFileSystem()
local_fs = fs.LocalFileSystem()
table = pq.read_table(
"bucket/data.parquet",
filesystem=s3_fs,
columns=["id", "value"]
)
dataset = ds.dataset(
"bucket/dataset/",
filesystem=s3_fs,
partitioning=ds.HivePartitioning.discover()
)
table = dataset.to_table(
filter=(ds.field("year") == 2024) & (ds.field("value") > 100),
columns=["id", "value"]
)
When to Use pyarrow.fs
Choose pyarrow.fs when:
- Your pipeline is Arrow/Parquet-native
- You need zero-copy integration with PyArrow datasets
- Predicate pushdown and column pruning are critical
- Working with partitioned Parquet datasets
Performance Considerations
- ✅ Excellent for Parquet workflows with high throughput
- ✅ Zero-copy data transfer with Arrow-native tools
- ✅ Efficient predicate pushdown and column pruning
- ⚠️ Limited async support compared to obstore
- ⚠️ Fewer protocol options than fsspec
obstore Library Guide
obstore is a high-performance Rust-based library for cloud storage access with native async support.
Installation
pip install obstore
Basic Usage
import obstore as obs
from obstore.store import S3Store, GCSStore, AzureStore
s3_store = S3Store(bucket='my-bucket', region='us-east-1')
gcs_store = GCSStore(bucket='my-bucket')
azure_store = AzureStore(container='my-container', account='myaccount')
data = obs.get(s3_store, 'path/to/file.parquet').bytes()
objects = obs.list(s3_store, prefix='data/2024')
for obj in objects:
print(obj["path"], obj["size"])
obs.put(s3_store, 'output/data.parquet', data_bytes)
Async Operations
import asyncio
import obstore as obs
async def fetch_multiple():
store = S3Store(bucket='my-bucket', region='us-east-1')
results = await asyncio.gather(
obs.get_async(store, 'file1.parquet'),
obs.get_async(store, 'file2.parquet'),
obs.get_async(store, 'file3.parquet')
)
return results
results = asyncio.run(fetch_multiple())
When to Use obstore
Choose obstore when:
- Performance is paramount (many small files, high concurrency)
- You need async/await support for concurrent operations
- You want minimal dependencies (Rust-based)
- Working with large-scale data ingestion/egestion
Performance Considerations
- ✅ 9x faster for concurrent operations vs fsspec
- ✅ Native sync/async support
- ✅ Zero Python dependencies
- ✅ Rust-based implementation
- ⚠️ Newer library (2025), rapidly evolving
- ⚠️ Smaller ecosystem than fsspec
DataFrame Integration
DataFrame libraries provide high-level abstractions for cloud storage I/O. This section covers integration patterns for Polars, DuckDB, Pandas, and PyArrow.
Quick Comparison
| Framework | Integration Approach | Best For |
|---|
| Polars | Native cloud URIs (s3://) + fsspec/PyArrow bridges | High-performance, lazy evaluation |
| DuckDB | HTTPFS extension + SQL interface | Analytical queries, SQL workflows |
| Pandas | fsspec auto-detection | Simple workflows, broad compatibility |
| PyArrow | Native filesystem + dataset scanning | Arrow-native pipelines, batch processing |
When to Use Which?
- Polars: Best for high-performance data pipelines with lazy evaluation, predicate pushdown, and memory efficiency
- DuckDB: Best for SQL-centric workflows, analytical queries on remote data without loading into memory
- Pandas: Best for simple scripts, small-to-medium data, maximum ecosystem compatibility
- PyArrow: Best for Arrow-native workflows, batch processing, and as a foundation for other tools
Polars
Polars provides native cloud storage support via the Rust object_store crate, plus fsspec and PyArrow integration for broader compatibility.
Key approaches:
- Native URIs: Direct
s3://, gs://, az:// support (recommended)
- fsspec bridge: For protocol chaining and caching
- PyArrow dataset: For Hive-partitioned datasets with complex pushdown
import polars as pl
df = pl.read_parquet("s3://bucket/data.parquet")
lazy_df = pl.scan_parquet("s3://bucket/dataset/**/*.parquet")
result = (
lazy_df
.filter(pl.col("date") > "2024-01-01")
.select(["id", "value"])
.collect()
)
df.write_parquet("s3://bucket/output/data.parquet")
df.write_parquet(
"s3://bucket/output/",
partition_by=["year", "month"],
use_pyarrow=True
)
fsspec bridge for caching:
import fsspec
cached_fs = fsspec.filesystem(
"simplecache",
target_protocol="s3"
)
df = pl.read_parquet("simplecache::s3://bucket/cached.parquet")
See: @building-data-pipelines for Polars fundamentals.
DuckDB
DuckDB's HTTPFS extension enables direct SQL queries on remote files without loading entire datasets into memory.
import duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("SET s3_region='us-east-1';")
df = con.sql("""
SELECT category, SUM(value) as total
FROM read_parquet('s3://bucket/data/*.parquet')
WHERE date >= '2024-01-01'
GROUP BY category
""").pl()
con.sql("""
COPY (SELECT * FROM my_table)
TO 's3://bucket/output.parquet'
(FORMAT PARQUET)
""")
Environment-based auth (recommended):
import os
os.environ['AWS_REGION'] = 'us-east-1'
Delta Lake integration:
con.execute("INSTALL delta; LOAD delta;")
df = con.sql("SELECT * FROM delta_scan('s3://bucket/delta-table/')").pl()
See: @building-data-pipelines for DuckDB fundamentals.
Pandas
Pandas leverages fsspec for automatic cloud URI handling, making remote files transparent to use.
import pandas as pd
df = pd.read_parquet("s3://bucket/data.parquet")
df = pd.read_csv("s3://bucket/data.csv.gz")
import fsspec
fs = fsspec.filesystem("s3")
df = pd.read_parquet(
"s3://bucket/data.parquet",
filesystem=fs,
columns=["id", "value"],
filters=[("date", ">=", "2024-01-01")]
)
df.to_parquet(
"s3://bucket/output/",
partition_cols=["year", "month"],
filesystem=fs
)
PyArrow filesystem for better performance:
import pyarrow.fs as fs
s3_fs = fs.S3FileSystem(region="us-east-1")
df = pd.read_parquet("bucket/data.parquet", filesystem=s3_fs)
See: @building-data-pipelines for pandas alternatives (Polars recommended for large data).
PyArrow
PyArrow provides the foundation for many DataFrame libraries with native filesystem integration and efficient dataset scanning.
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.fs as fs
s3_fs = fs.S3FileSystem(region="us-east-1")
table = pq.read_table(
"bucket/file.parquet",
filesystem=s3_fs,
columns=["id", "value"]
)
dataset = ds.dataset(
"bucket/dataset/",
filesystem=s3_fs,
partitioning=ds.HivePartitioning.discover()
)
table = dataset.to_table(
filter=(ds.field("year") == 2024) & (ds.field("value") > 100),
columns=["id", "value"]
)
scanner = dataset.scanner(
filter=ds.field("value") > 0,
batch_size=65536
)
for batch in scanner.to_batches():
process(batch)
fsspec bridge:
import fsspec
fs = fsspec.filesystem("s3")
with fs.open("s3://bucket/file.parquet", "rb") as f:
table = pq.read_table(f)
See: @building-data-pipelines for PyArrow fundamentals.
Format Considerations
For detailed information on storage formats (Parquet, Arrow, Lance, Zarr, Avro, ORC) and lakehouse table formats (Delta Lake, Iceberg, Hudi), including compression, schema evolution, and format selection guidance, see @designing-data-storage. This section focuses on I/O patterns, not format internals.
Authentication
All three libraries follow standard cloud authentication patterns: explicit credentials → environment variables → config files → IAM roles/Managed Identities.
Performance Optimization
Key strategies:
- Caching: fsspec's
SimpleCache for repeated access
- Concurrency: obstore async API for many small files
- Predicate pushdown: Filter at storage layer using partitioning
- Column pruning: Read only required columns
See: performance.md in this skill for detailed guidance.
References