تشغيل أي مهارة في Manus بنقرة واحدة

sql-ml-features

النجوم٥٦

التفرعات٣

آخر تحديث١٣ يونيو ٢٠٢٦ في ٢٠:٤٨

Use when preparing data for machine learning from SQL Server — feature engineering in T-SQL, building training/test datasets, statistical aggregations for ML pipelines, sampling strategies, data normalization and encoding in SQL, writing queries that feed pandas or scikit-learn, exporting to Parquet or CSV for model training, or when a data scientist asks for a 'feature table' or 'training set' from a SQL Server database.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

damusix

damusix/skills

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

علماء البياناتمهن الحاسوب والرياضيات·SOC 15-2051

مستكشف الملفات

7 ملفات

SKILL.md

readonly

المزيد من هذا المستودع

نفس المستودع

cli-building

damusix/skills

Use when building TypeScript CLIs. Guides command structure, interactive prompts, and tab completion using citty, `@clack/prompts`, and `@bomb.sh/tab` — and shipping them: CI gates, versioning (Changesets / release-please), and npm or single-binary release.

2026-06-1656

arm-bandits-expert

damusix/skills

Implements, evaluates, and deploys multi-armed bandit algorithms — including Thompson Sampling, UCB, epsilon-greedy, LinUCB, EXP3, and contextual bandits. Covers algorithm selection, experiment harnesses, offline evaluation (IPS, Doubly Robust), infrastructure patterns, and correctness verification. Use when the user asks about multi-armed bandits, exploration-exploitation tradeoffs, adaptive experiments, A/B testing alternatives, online optimization, bandit-based recommendation or personalization systems, or contextual bandits.

2026-06-1356

ast-grep

damusix/skills

AST-based code search, lint, and rewrite using ast-grep. Use when finding code patterns structurally (not textually), writing lint rules, building codemods, or migrating API usage across a codebase. Prefer over regex grep when the match target is a syntactic construct (function call, import, class field, assignment).

2026-06-1356

ghost-theme

damusix/skills

Build, customize, and deploy Ghost CMS themes. Use whenever the user mentions Ghost themes, Ghost CMS, Handlebars templates (.hbs), Ghost Admin, membership/subscription integration, custom settings, or the content API — even without saying "theme". Trigger on: building a blog theme, creating a Ghost site, editing .hbs templates, member-only content, hero sections, routing (routes.yaml), image optimization, dark mode, search, deploy, gscan validation, JSON-LD/SEO, or any mention of {{ghost_head}}, {{ghost_foot}}, {{#foreach}}, {{#get}}, {{img_url}}, {{asset}}, @custom, @member, or Portal. Also use to modify, extend, or debug an existing theme. Also trigger on Koenig editor cards and .kg-* card styling (kg-image-card, kg-gallery-card, kg-bookmark-card, kg-callout-card, kg-toggle-card, kg-header-card, kg-signup-card), card_assets, cards.min.css, and breakout/wide/full-width card layout.

2026-06-1356

google-zx-scripting

damusix/skills

Writes and executes JavaScript-based shell scripts using Google's zx library. Use when writing shell scripts, automation, build tools, file processing, CLI tools, deployment scripts, data pipelines, or batch operations. Also covers piping, streams, parallel execution, retries, cross-platform scripting, built-in fs utilities, and minimist argument parsing.

2026-06-1356

hapi

damusix/skills

Creates, configures, and debugs `@hapi/hapi` servers — implements routes, plugins, auth schemes, validation, caching, and request lifecycle hooks. Use when building HTTP APIs, setting up stale-while-revalidate caching, registering server methods, configuring views, managing startup sequences, or troubleshooting response marshalling.

2026-06-1356

name

sql-ml-features

description

SQL for Machine Learning

Patterns for extracting ML-ready features from SQL Server — turning normalized relational data into the wide, denormalized, NULL-free datasets that models consume.

When to Use

Building a feature table or training dataset from application data
Feature engineering in T-SQL (rolling aggregations, lag features, RFM scoring)
Encoding categorical variables (one-hot, ordinal, frequency encoding)
Handling NULLs for ML (imputation strategies in SQL)
Sampling and train/test splitting from SQL Server
Exporting large datasets to CSV, Parquet, or pandas
Preventing data leakage in temporal feature queries
Writing queries that feed scikit-learn, XGBoost, or PyTorch pipelines

When NOT to use: application schema design (table design, naming conventions, access control), query performance tuning (execution plans, index tuning, wait stats), BI dashboards and summary reports (GROUPING SETS, pivot tables, dashboard queries), or running R/Python inside SQL Server (SQL Server ML Services).

Feature Table Build Workflow

Follow these steps in order. Each step has a validation checkpoint.

Set snapshot date — Anchor all features to a fixed @SnapshotDate. Never use GETDATE() inside feature queries.
- Validate: SELECT @SnapshotDate returns the intended date.
Build base entity table — SELECT DISTINCT the entity key into a temp table #Base. One row per entity, primary key only — no features yet.
```
SELECT DISTINCT CustomerId
INTO #Base
FROM Customers
WHERE SignupDate < @SnapshotDate;
```
- Validate: SELECT COUNT(*), COUNT(DISTINCT CustomerId) FROM #Base — counts must match (no duplicates).
Join features with temporal bounds — LEFT JOIN each feature set to #Base. Bound every join with AND EventDate <= @SnapshotDate.
```
SELECT b.CustomerId,
       DATEDIFF(DAY, f.LastOrder, @SnapshotDate) AS Recency,
       f.OrderCount AS Frequency
FROM #Base b
LEFT JOIN FeatureCTE f ON f.CustomerId = b.CustomerId;
```
- Validate: Row count equals #Base row count. No feature references dates after @SnapshotDate.
Impute NULLs — Replace every NULL with an explicit value. Use COALESCE with mean, median, zero, or a sentinel depending on the column semantics.
- Validate: SELECT SUM(CASE WHEN col IS NULL THEN 1 ELSE 0 END) FROM FeatureTable returns 0 for every column.
Encode categoricals — One-hot, ordinal, or frequency encode all non-numeric columns.
- Validate: SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'FeatureTable' — all columns are numeric types (int, float, decimal, bit).
Split train/test — Use hash-based split for cross-sectional data or cutoff-date split for time series.
- Validate: SELECT SplitLabel, COUNT(*) FROM FeatureTable GROUP BY SplitLabel — verify expected proportions (e.g., 80/20).
Export — BCP, pandas.read_sql, or FOR JSON depending on the consumer.

Feature Engineering Taxonomy

Category	Examples	T-SQL tools
Numeric	Raw values, `LOG(col + 1)` for skewed, ratios, z-scores	`LOG(col + 1)`, `SQRT`, window `AVG`/`STDEV`
Categorical	One-hot encoding, ordinal, frequency	`CASE`, `PIVOT`, `DENSE_RANK`, COUNT ratios
Temporal	Recency, duration, day-of-week	`DATEDIFF`, `DATEPART`, `DATENAME`
Rolling window	7-day sum, 30-day average	`SUM/AVG OVER (ROWS BETWEEN ...)`
Lag / offset	Previous value, delta from prior	`LAG`, `LEAD`
Interaction	Product, ratio of two features	Computed expressions in SELECT
Text signals	Length, keyword presence	`LEN`, `CHARINDEX`, `PATINDEX`
Missingness	Is this value missing?	`CASE WHEN col IS NULL THEN 1 ELSE 0 END`

Quick Reference: SQL Pattern → ML Concept

ML concept	T-SQL pattern
Recency feature	`DATEDIFF(DAY, LastEventDate, @SnapshotDate)`
Frequency feature	`COUNT(*) OVER (PARTITION BY entity ORDER BY dt ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)`
Rolling 7-day revenue	`SUM(Amount) OVER (ORDER BY OrderDate ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`
Lag feature (t-1)	`LAG(Amount, 1) OVER (PARTITION BY CustomerId ORDER BY OrderDate)`
One-hot encode	`CASE WHEN Category = 'A' THEN 1 ELSE 0 END AS Category_A`
Ordinal encode	`DENSE_RANK() OVER (ORDER BY Category)`
Frequency encode	`COUNT() OVER (PARTITION BY Category) 1.0 / COUNT(*) OVER ()`
Log transform	`LOG(Amount + 1)` — the `+1` offset handles zero values; omit only when zeros are impossible
Quantile bucket	`NTILE(10) OVER (ORDER BY Score)`
Row hash for split	5-step chain: (1) `CAST(Id AS NVARCHAR(20))` → (2) `HASHBYTES('SHA2_256', ...)` → (3) `CAST(... AS BINARY(8))` → (4) `CAST(... AS BIGINT)` → (5) `ABS(...) % 10`
Mean imputation	`COALESCE(col, AVG(col) OVER ())`
Median imputation	`PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col) OVER ()`
NULL indicator	`CASE WHEN col IS NULL THEN 1 ELSE 0 END AS col_is_missing`
Min-max scaling	`(col - MIN(col) OVER ()) / NULLIF(MAX(col) OVER () - MIN(col) OVER (), 0)`
Date bucketing	`DATE_BUCKET(WEEK, 1, EventDate)` (SQL Server 2022+)
Date truncation	`DATETRUNC(MONTH, EventDate)` (SQL Server 2022+)
First value in series	`FIRST_VALUE(col) OVER (PARTITION BY entity ORDER BY dt ROWS UNBOUNDED PRECEDING)`
Rolling volatility	`STDEV(col) OVER (PARTITION BY entity ORDER BY dt ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)`

Common Mistakes

Mistake	What goes wrong	Fix
Using `AVG(col)` without understanding NULL exclusion	AVG divides by non-NULL count, not row count — mean is inflated when NULLs represent zeros	Use `AVG(COALESCE(col, 0))` when NULLs mean zero
Rolling window with `RANGE` instead of `ROWS`	Tied timestamps include extra rows, producing different counts per row	Always use `ROWS BETWEEN n PRECEDING AND CURRENT ROW`
Joining to future events in feature query	Leaks information that wasn't available at prediction time	Bound all joins with `AND EventDate <= @SnapshotDate`
Computing global statistics before the train/test split	Mean and variance incorporate test data into training features	Compute statistics only over training rows
Using `LAG` without checking temporal order	LAG requires `ORDER BY` — missing it gives non-deterministic offsets	Always include `PARTITION BY entity ORDER BY timestamp`
Forward-filling sensor NULLs across entity boundaries	PARTITION BY is missing, so last value bleeds across entities	Always `PARTITION BY entity_id` in forward-fill window
HASHBYTES on `NULL` key	`HASHBYTES(algo, NULL)` returns NULL — entity lands in no split	Exclude or handle NULL keys before splitting
UNPIVOT for categorical to indicator columns	`UNPIVOT` silently drops NULL values	Use `CROSS APPLY VALUES` to preserve NULLs as zeros
Random split for time-series data	Test set leaks future patterns into training window	Use a cutoff date for time-series train/test split
Including the label column in lag features	`LAG(label)` leaks ground truth about adjacent rows	Exclude any derivative of the target from feature set

Reference Files

Feature Engineering — numeric, categorical, temporal, rolling windows, lag, RFM, text
Sampling and Splitting — TABLESAMPLE, hash-based splits, NTILE k-fold, stratified, time-based
NULL Imputation — mean/median/mode/forward-fill/constant, missingness indicators
Export Patterns — BCP, pandas.read_sql, chunked reads, FOR JSON, Parquet, feature stores
Data Leakage — temporal, target, train/test leakage; prevention patterns in SQL