| name | msgvault-query |
| description | Query msgvault email archive analytics via SQL views. Use when: querying email history, analyzing senders/domains/labels, thread analysis, attachment stats, messages per month, sender graphs, domain breakdowns, email analytics. Triggers on: msgvault, email archive, email search, email analytics, sender analysis, domain analysis. |
| triggers | ["msgvault","email archive","email search","email analytics","sender analysis","domain analysis"] |
msgvault-query
Run SQL against the msgvault email archive via msgvault query. The analytics cache is DuckDB over Parquet — queries run in milliseconds. No DuckDB binary or Parquet path knowledge required.
The analytics cache is built automatically when stale or missing.
Quick Start
msgvault query --format table "SELECT from_email, message_count FROM v_senders ORDER BY message_count DESC LIMIT 10"
msgvault query --format csv "SELECT subject, sent_at, from_email FROM v_messages WHERE from_domain = 'example.com' AND year = 2024 ORDER BY sent_at DESC"
msgvault query "SELECT name, message_count, total_size FROM v_labels ORDER BY message_count DESC"
Available Views
See references/views.md for full column schemas.
Base views (direct Parquet data)
| View | Description |
|---|
messages | Raw message metadata partitioned by year |
participants | Email addresses with domain and display name |
message_recipients | Message-to-participant links (from/to/cc/bcc) |
labels | Gmail label names |
message_labels | Message-to-label links |
attachments | Attachment metadata (filename, size) per message |
conversations | Thread grouping |
sources | Synced accounts |
Convenience views (pre-joined aggregates)
| View | Description |
|---|
v_messages | Messages with resolved sender (from_email, from_name, from_domain) and labels as JSON array |
v_senders | Per-sender aggregates: message_count, total_size, attachment stats, first/last message |
v_domains | Per-domain aggregates: message_count, total_size, sender_count |
v_labels | Per-label aggregates: message_count, total_size |
v_threads | Per-conversation aggregates: message_count, date range, participant_emails as JSON array |
Output Formats
msgvault query "..."
msgvault query --format csv "..."
msgvault query --format table "..."
Common Queries
Top senders
SELECT from_email, from_name, message_count, total_size
FROM v_senders
ORDER BY message_count DESC
LIMIT 20
Domain breakdown
SELECT domain, message_count, sender_count, total_size
FROM v_domains
ORDER BY message_count DESC
LIMIT 20
Messages from a domain in a date range
SELECT subject, sent_at, from_email, snippet
FROM v_messages
WHERE from_domain = 'example.com'
AND year BETWEEN 2022 AND 2024
ORDER BY sent_at DESC
LIMIT 50
Label distribution
SELECT name, message_count, total_size
FROM v_labels
ORDER BY message_count DESC
Thread analysis
SELECT conversation_title, message_count, first_message_at, last_message_at, participant_emails
FROM v_threads
ORDER BY message_count DESC
LIMIT 20
Large attachments
SELECT m.subject, m.from_email, m.sent_at, a.filename, a.size
FROM v_messages m
JOIN attachments a ON a.message_id = m.id
ORDER BY a.size DESC
LIMIT 20
Messages per month
SELECT year, month, COUNT(*) AS message_count
FROM messages
GROUP BY year, month
ORDER BY year, month
CLI Commands (non-SQL)
For tasks that don't need SQL aggregation, use these directly:
| Task | Command |
|---|
| Archive stats | msgvault stats |
| Full-text search | msgvault search "<query>" --json |
| Incremental sync | msgvault sync-incremental <email> |
| Full sync | msgvault sync-full <email> |
| Build analytics cache | msgvault build-cache |
| Interactive TUI | msgvault tui |
Tips
- Partition pruning: Always add
WHERE year = YYYY or year BETWEEN X AND Y when filtering by date — this skips entire Parquet partitions and cuts query time significantly.
- Labels are JSON: In
v_messages, labels is a JSON array string. Use DuckDB's json_array_contains(labels, 'INBOX') to filter by label, or join through message_labels + labels for exact matching.
- Sender resolution is dual-path:
v_messages resolves the sender via message_recipients (email messages) with a fallback to messages.sender_id (chat messages). Use v_messages instead of messages whenever you need from_email or from_name.
- All queries are read-only:
msgvault query never modifies the archive.
- Default format is JSON: Pipe through
jq for further filtering. Use --format table for human-readable output.