con un clic
con un clic
| name | birdclaw |
| description | X/Twitter archive search: yearly vibes, odd tweets, quality filters. |
Use this for X/Twitter archive questions before web/API lookup. Local archive first; live X only when explicitly needed for current account state.
Prefer:
~/Projects/birdclawbirdclaw~/.birdclaw/birdclaw.sqliteCheck basic health/freshness before analysis:
birdclaw --json db stats
sqlite3 ~/.birdclaw/birdclaw.sqlite "pragma quick_check;"
Match the depth of read to the task:
birdclaw whois <query> --context 4 --no-xurl-fallback --json. This searches local DMs, adds surrounding context, resolves archive numeric profiles from the persistent cache and bird, and avoids xurl unless you explicitly allow it.birdclaw search dms <query> --context 4 --resolve-profiles --expand-urls --no-xurl-fallback --json when adjacent messages, profile names, or expanded t.co links matter.--originals-only --hide-low-quality, full year, --limit 20000. One year at a time.--originals-only --hide-low-quality, --limit 20000 per year. Expect to ingest 50k+ tweets total. Do NOT shortcut this with a top-N like_count desc SQL query — that yields only viral peaks and misses the everyday texture, recurring themes, and emotional tone the task needs.Top-liked SQL slices are for spot-checking, not for vibe work. A 30-row order by like_count desc is the wrong tool for any task that asks for arc, narrative, or "what was X like."
Prefer cached local-first commands before web/API:
birdclaw whois blacksmith --context 4 --no-xurl-fallback --json
birdclaw search dms "blacksmith" --context 4 --resolve-profiles --expand-urls --no-xurl-fallback --json
Caching model:
profiles, then sync_cache, then bird userxurl is the last fallback; pass --no-xurl-fallback when avoiding X API spend matterssync_cache first and mirrors results into persistent url_expansions; use --refresh-url-cache only when stale links matterprofileEvidence in whois --json to separate affiliation, bio_handle, bio_domain, bio_company, profile_url, profile_bio_url, profile_history, dm_context, and expanded_url matchesHow the richer identity evidence works:
bird profiles ... --json is the preferred batch profile hydrator when several archive profile IDs need refreshing; bird user --profile-only --json is the single-profile fallback. Both can expose X GraphQL profile URL entities and highlighted-label affiliations without using the paid X API.profiles, active organization/badge edges in profile_affiliations, profile-change history in profile_snapshots, and extracted bio identity hints in profile_bio_entities; backups include all four shards.identity_search_index for fast local whois lookups. It is rebuilt from profile/bio/affiliation/history data and should not be treated as source-of-truth evidence.bird on a fresh profile hydration and rewrites the edge to the real local organization profile id when available.@handle, domain, and company-phrase rows. This is why whois "blacksmith guy" can rank someone from @useblacksmith and blacksmith.sh even if the exact phrase was not in the DM text.whois can surface old matching values as profile_history.whois scores profile bio/name/handle matches, profile URL and bio URL matches, affiliation matches, bio entity matches, profile-history matches, DM context, and expanded t.co URLs separately. It ranks current affiliation and bio identity evidence above plain domains, distinguishes ecosystem labels such as "GitHub Star" from staff/company matches, and buckets human output into likely affiliated, ecosystem, profile/link, DM-context, and other matches.--current-affiliation <org> for strict active badge matches, --affiliation <org> for active/bio/history affiliation evidence, and --exclude-domain-only when a query like "GitHub people" should ignore accounts that only have github.com links.local/sync_cache and URL expansions from cache; use refresh flags only when current profile/bio/link evidence matters.Use --expand-urls when t.co links are evidence. It may touch the network on cache miss, but it is not an X API call.
Use the persistent link index when looking for remembered shared tweets, videos, or t.co expansions:
birdclaw links backfill
birdclaw --json search links "the work" --source dm --media video --limit 50
Notes:
links backfill indexes tweet/DM URL occurrences and expands missing/error/miss t.co rows; use --refresh-url-cache to force re-expansion.t.co; add --all-urls only when non-shortened links matter.search links matches short URLs, expanded URLs, linked tweet text/author, and source tweet/DM text.url_expansions + link_occurrences; both are included in Git-friendly backups under data/links/.For annual summaries, compare raw counts against summary-quality originals:
birdclaw --json search tweets --since 2020-01-01 --until 2021-01-01 --limit 20000
birdclaw --json search tweets --since 2020-01-01 --until 2021-01-01 --originals-only --hide-low-quality --limit 20000
Use exact date bounds: YYYY-01-01 inclusive to next-year YYYY-01-01 exclusive. Report counts and note archive gaps if stats show them.
When summarizing vibe:
--originals-only is separate from quality. It excludes authored replies using the current Birdclaw query contract.
--hide-low-quality maps to qualityFilter: summary. It hides common noise while preserving meaningful short posts:
https://t.co/ URLst.co links and no mediaIt should preserve:
For full-year summary work, default to exact bounds:
birdclaw --json search tweets --since 2020-01-01 --until 2021-01-01 --originals-only --hide-low-quality --limit 20000
In the current implementation, "low-like" means like_count < 50.
Before changing thresholds, inspect real included and excluded examples.
Recommended checks:
--min-likes, media flags, or debug reason output only when the use case needs itUseful SQL sketch for rule tuning:
sqlite3 ~/.birdclaw/birdclaw.sqlite "
select id, created_at, like_count, text
from tweets
where created_at >= '2020-01-01' and created_at < '2021-01-01'
order by random()
limit 50;"
Use backup sync when asked to preserve or restore the local archive via GitHub:
birdclaw --json backup sync --repo ~/Projects/backup-birdclaw --remote https://github.com/steipete/backup-birdclaw.git
Included source-of-truth shards: accounts, profiles, profile affiliations/snapshots/bio entities, tweets, tweet collections, timeline edges, DMs, blocks, mutes, AI scores, tweet actions, and link index rows.
Not backed up intentionally: sync_cache, identity_search_index, FTS tables/shadow tables, local SQLite files, and config.json. URL expansion cache rows are persisted into backed url_expansions.
After query/filter changes, run focused tests first:
pnpm test src/lib/queries.test.ts src/cli.test.ts src/routes/api/query.test.ts
After link-index or backup changes:
pnpm test src/lib/url-expansion.test.ts src/lib/link-index.test.ts src/lib/backup.test.ts
Then run the broader release-relevant gate:
pnpm run check
pnpm test
pnpm build
Smoke the CLI with a real year query:
pnpm --silent cli --json search tweets --since 2020-01-01 --until 2021-01-01 --originals-only --hide-low-quality --limit 20000