| name | code-science |
| description | Scientific programming best practices including reproducible research, computational notebooks, version control for research, data management, HPC/parallel computing, and research software engineering. Use when user needs help with research code organization, reproducibility, scientific Python/R workflows, or computational infrastructure. Triggers on "reproducible research", "research code", "scientific computing", "HPC", "parallel computing", "Jupyter", "notebook", "data management plan", "research software", "code review for science". |
Scientific Programming
Best practices for research software and reproducible computation.
Project Structure
project/
โโโ README.md # Project overview, how to reproduce
โโโ LICENSE # MIT, Apache 2.0, or GPL
โโโ requirements.txt # or environment.yml (conda)
โโโ setup.py / pyproject.toml
โโโ data/
โ โโโ raw/ # Never modify raw data
โ โโโ processed/ # Cleaned/transformed data
โ โโโ external/ # Third-party data
โโโ src/ or scripts/
โ โโโ data_processing.py
โ โโโ analysis.py
โ โโโ models.py
โ โโโ visualization.py
โโโ notebooks/ # Exploratory analysis
โ โโโ 01_eda.ipynb
โ โโโ 02_modeling.ipynb
โ โโโ 03_figures.ipynb
โโโ results/
โ โโโ figures/
โ โโโ tables/
โโโ tests/
โโโ docs/
Reproducibility Checklist
-
Environment: Pin all dependencies with versions
pip freeze > requirements.txt
conda env export > environment.yml
-
Random seeds: Set and document all random seeds
import numpy as np
import random
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
-
Data versioning: Use DVC or git-lfs for large data
dvc init
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc
-
Configuration: Separate config from code
import yaml
with open('config.yaml') as f:
config = yaml.safe_load(f)
-
Logging: Record all experiments
import logging
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s: %(message)s',
filename='experiment.log')
Parallel Computing
from multiprocessing import Pool
import numpy as np
def process_chunk(data):
return heavy_computation(data)
with Pool(processes=8) as pool:
results = pool.map(process_chunk, data_chunks)
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
with ProcessPoolExecutor(max_workers=8) as executor:
results = list(executor.map(process_func, items))
with ThreadPoolExecutor(max_workers=20) as executor:
results = list(executor.map(fetch_data, urls))
Performance Optimization
import cProfile
cProfile.run('my_function()', sort='cumulative')
result = [x**2 + 2*x + 1 for x in data]
result = data**2 + 2*data + 1
Data Management
FAIR Principles
- Findable: Persistent identifiers (DOI), rich metadata
- Accessible: Open protocols, authentication when needed
- Interoperable: Standard formats (CSV, JSON, HDF5, NetCDF)
- Reusable: Clear license, provenance, community standards
File Formats for Science
| Format | Best For | Size | Speed |
|---|
| CSV | Small tabular, universal | Large | Slow |
| Parquet | Large tabular, columnar | Small | Fast |
| HDF5 | Multidimensional arrays | Small | Fast |
| NetCDF | Climate/geospatial | Small | Fast |
| FITS | Astronomy | Medium | Fast |
| Feather | DataFrame interchange | Small | Very fast |
df.to_parquet('data.parquet', compression='snappy')
df = pd.read_parquet('data.parquet')
import h5py
with h5py.File('data.h5', 'w') as f:
f.create_dataset('experiment1', data=array)
Testing Scientific Code
import numpy as np
import pytest
def test_conservation_law():
"""Physical quantities should be conserved"""
initial_energy = compute_energy(initial_state)
final_energy = compute_energy(simulate(initial_state))
np.testing.assert_allclose(initial_energy, final_energy, rtol=1e-6)
def test_known_solution():
"""Compare against analytical solution"""
numerical = solve_numerically(params)
analytical = analytical_solution(params)
np.testing.assert_allclose(numerical, analytical, atol=1e-4)
def test_symmetry():
"""Result should be symmetric under transformation"""
result1 = compute(data)
result2 = compute(transform(data))
np.testing.assert_array_equal(result1, result2)
Tips
- Raw data is sacred โ never modify it, only create processed copies
- Use version control (git) from day one
- Write README before writing code
- Automate the full pipeline (Makefile or Snakemake)
- Document assumptions and decisions in code comments
- Use type hints for clarity in scientific code
- Publish code alongside papers (GitHub + Zenodo for DOI)