| name | valency-anndata |
| description | Helper for the valency-anndata Python package — an AnnData-based toolkit for analyzing Polis opinion/voting data. Use when working with Polis conversations, running the Polis analysis pipeline (recipe_polis), loading datasets, preprocessing vote matrices, clustering participants, or visualizing results. Triggers on: AnnData vote matrices, Polis data, opinion clustering, val.datasets/val.preprocessing/val.tools/val.viz usage, recipe_polis, schematic diagrams, langevitour, jupyter-scatter, or any valency-anndata API call.
|
Valency AnnData
Toolkit for analyzing Polis opinion/voting data using AnnData. Follows scanpy namespace conventions.
On Skill Load
When this skill is invoked for exploring a Polis conversation, use AskUserQuestion to ask which perspective map projections the user would like to explore. PCA is always included via recipe_polis. The additional options are:
- PaCMAP (Recommended) —
val.tools.pacmap() — preserves both local and global structure
- LocalMAP —
val.tools.localmap() — focuses on local neighborhood structure, lighter than PaCMAP
- UMAP —
val.tools.umap() — popular nonlinear projection, requires computing neighbors first
- t-SNE —
val.tools.tsne() — classic nonlinear projection, good for visualization
Allow multi-select. Run recipe_polis first (PCA is always included), then compute the selected projections with per-embedding k-means clustering for each.
After running recipe_polis, computing selected projections, and calling val.preprocessing.calculate_qc_metrics(), use a second AskUserQuestion to ask which .obs annotations the user would like to plot alongside the default cluster labels (kmeans_*). The available QC metrics are:
- pct_seen (Recommended) — fraction of statements the participant voted on
- pct_agree (Recommended) — fraction of votes that were agree
- pct_disagree — fraction of votes that were disagree
- pct_pass — fraction of votes that were pass
- mean_vote — average vote value (-1 to +1)
Allow multi-select. Then for each embedding, pass color=["kmeans_<basis>", ...selected_annotations] to val.viz.embedding().
API Namespace
import valency_anndata as val
val.datasets # Load Polis conversation data
val.preprocessing # Preprocessing
val.tools # Analysis tools
val.viz # Visualization
val.scanpy # Re-exported scanpy (pp, tl, pl, get)
Data Model
Core structure: participants x statements AnnData matrix. Votes are -1/+1 with NaN for unseen.
.X — vote matrix
.obs — participant metadata + QC metrics + cluster labels
.var — statement metadata (content, is_meta, moderation_state, etc.)
.layers — intermediate matrices (X_masked, X_masked_imputed_mean)
.obsm — embeddings (X_pca_polis, X_pacmap, X_umap)
.uns — raw votes, statements, pipeline params
For full data model details, see references/data-model.md.
Loading Data
Always pass translate_to= matching the language you are currently speaking with the user, unless they ask otherwise. For example, if the conversation is in English, use translate_to="en"; if in French, use translate_to="fr".
adata = val.datasets.polis.load("https://pol.is/report/r29kkytnipymd3exbynkd", translate_to="en")
adata = val.datasets.polis.load("https://pol.is/4asymkcrjf", translate_to="en")
adata = val.datasets.polis.load("r2dxjrdwef2ybx2w9n3ja", translate_to="en")
adata = val.datasets.polis.load("https://polis.tw/report/r29kkytnipymd3exbynkd", translate_to="en")
adata = val.datasets.polis.load("/path/to/export/", translate_to="en")
adata = val.datasets.aufstehen(translate_to="en")
The Polis Pipeline (recipe_polis)
End-to-end Small et al. pipeline. Run with:
val.tools.recipe_polis(adata)
Six sequential steps:
_zero_mask() — Mask metadata & moderated statements. Requires .var["is_meta"]. Creates .layers["X_masked"].
impute() — Column-mean imputation of NaN values. Creates .layers["X_masked_imputed_mean"].
pca() — Standard PCA on imputed matrix. Creates .obsm["X_pca_masked_unscaled"].
_sparsity_aware_scaling() — Divides PCA by sparsity scaling factors (via reddwarf). Creates .obsm["X_pca_polis"].
_cluster_mask() — Exclude participants with < 7 votes from clustering. Creates .obs["cluster_mask"].
kmeans() — Silhouette-scored k-means (k=2..5). Creates .obs["kmeans_polis"].
Key parameters:
val.tools.recipe_polis(
adata,
participant_vote_threshold=7,
key_added_pca="X_pca_polis",
key_added_kmeans="kmeans_polis",
inplace=True,
)
Custom pipelines can import helper steps directly:
from valency_anndata.tools._polis import _zero_mask, _cluster_mask, _sparsity_aware_scaling
Statement Clustering (recipe_polis2)
LLM-based statement clustering (requires pip install valency-anndata[polis2]):
val.tools.recipe_polis2_statements(adata)
Preprocessing
val.preprocessing.impute(adata, strategy="mean", source_layer="X_masked", target_layer="X_masked_imputed_mean")
val.preprocessing.calculate_qc_metrics(adata, inplace=True)
val.preprocessing.rebuild_vote_matrix(adata, trim_rule=1.0, inplace=True)
val.preprocessing.neighbors(adata, ...)
Tools (Beyond recipe_polis)
Embedding priority: Always run recipe_polis first (for comparison), then prefer PaCMAP and LocalMAP. Only use UMAP if explicitly requested.
Per-embedding clustering: Always run separate k-means for each embedding representation. Don't reuse kmeans_polis for PaCMAP/LocalMAP plots.
LAYER = 'X_masked_imputed_mean'
val.tools.pacmap(adata, layer=LAYER)
val.tools.localmap(adata, layer=LAYER)
val.tools.kmeans(adata, use_rep='X_pacmap', key_added='kmeans_pacmap')
val.tools.kmeans(adata, use_rep='X_localmap', key_added='kmeans_localmap')
val.tools.pca(adata, ...)
val.tools.umap(adata, ...)
val.tools.tsne(adata, ...)
val.tools.leiden(adata, ...)
Visualization
Use val.viz.embedding with color= set to the matching k-means key for each basis — it handles titling automatically.
val.viz.schematic_diagram(adata)
with val.viz.schematic_diagram(diff_from=adata):
val.tools.recipe_polis(adata)
val.viz.embedding(adata, basis="X_pca_polis", color=["kmeans_polis", "pct_seen"])
val.viz.embedding(adata, basis="X_pacmap", color=["kmeans_pacmap", "pct_seen", "pct_agree"])
val.viz.embedding(adata, basis="X_localmap", color=["kmeans_localmap", "pct_seen", "pct_agree"])
val.viz.langevitour(adata, use_reps=["X_umap", "X_pca[:10]"], color="leiden")
val.viz.jscatter(adata, ...)
CLI Exploration
When exploring from the CLI (not a notebook), save plots as PNGs and open them on the user's system.
Important: Do NOT use fig, ax = plt.subplots() with ax=ax — this is incompatible with multiple color keys. Instead, let scanpy manage figure layout, use show=False, and save via plt.savefig():
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
val.viz.embedding(adata, basis='X_pacmap', color=['kmeans_pacmap', 'pct_seen', 'pct_agree'], show=False)
plt.savefig('/tmp/polis_pacmap.png', dpi=150, bbox_inches='tight')
plt.close()
Then open with open /tmp/polis_pacmap.png (macOS).
Typical Notebook Workflow
import valency_anndata as val
adata = val.datasets.polis.load("https://pol.is/report/r29kkytnipymd3exbynkd")
val.datasets.polis.translate_statements(adata, translate_to="en")
val.viz.schematic_diagram(adata, diff_from=None)
with val.viz.schematic_diagram(diff_from=adata):
val.tools.recipe_polis(adata)
val.preprocessing.calculate_qc_metrics(adata, inplace=True)
val.viz.pca(adata, color="kmeans_polis")
val.viz.embedding(adata, basis="pacmap", color=["kmeans_pacmap", "pct_seen", "pct_agree"])
val.viz.langevitour(adata, use_reps=["X_umap", "X_pca[:10]"], color="leiden")
Common Gotchas
.var["is_meta"] must exist before recipe_polis — ValueError otherwise.
ipywidgets<8 is pinned for Colab compatibility. Don't bump without testing Colab.
setuptools<81 required because langevitour imports setuptools at runtime.
- PaCMAP crashes notebook kernel on Python 3.10 in CI — use 3.11+ for CI.
- Private modules use
_underscore prefix. Only functions in __init__.py are public API.
- Use
uv run for all commands (project uses uv exclusively).
Development
uv sync --extra dev
uv run ruff check src/
uv run ruff format src/
make test
make test-live
make serve
make docs