| name | dct-profile |
| description | Use this skill when the user wants to analyze data quality, profile data files, check value distributions, perform character analysis on text fields, identify data quality issues, or get statistics about dataset contents. Triggers include "profile this data", "analyze data quality", "check for nulls", "value distribution", "character frequency", "data statistics", "column profiling", or when doing exploratory data analysis or quality assessment. |
DCT Profile - Data Quality Analysis
Analyze data files for value distributions, unique counts, and character frequencies.
When to Use
Use this skill when you need to:
- Assess data quality before processing
- Identify anomalies or outliers
- Check for null/missing values
- Analyze text field character distributions
- Understand value cardinality
- Validate data format compliance
Installation
which dct || go build -o dct && chmod +x ./dct
Usage
dct prof <file> [flags]
Arguments
file: Data file to profile (CSV, JSON, NDJSON, or Parquet)
Flags
-o, --output <file>: Output to file instead of stdout
Examples
Profile a CSV file:
dct prof data.csv
Profile Parquet file:
dct prof large.parquet
Save profile report:
dct prof messy.csv -o data_quality_report.txt
Profile JSON data:
dct prof data.json
Output Sections
The profile report includes detailed analysis for each column:
1. Count Statistics
Basic cardinality information:
-- Field: `email` --
Count: 1000
Unique Count: 995
2. Value Occurrences
Most common values with their frequencies:
Value Occurrence
row: value -> count
0: user@example.com -> 1
1: admin@example.com -> 1
...
MOSTLY UNIQUE VALUES SHOWING SAMPLE...
For high-cardinality fields, shows a sample of unique values.
3. String Length Statistics
For text fields, provides length metrics:
Value Summary - String Lengths
Min: 10
Mean: 22.500000
Max: 45
4. Character Frequency Analysis
Detailed character-level statistics:
Char Occurrence
row: rune -> count
00: '@' (hex: U+0040) (dec: 64) -> 1000
01: '.' (hex: U+002E) (dec: 46) -> 1000
02: 'e' (hex: U+0065) (dec: 101) -> 2500
Shows:
- Character symbol
- Hexadecimal code (U+XXXX)
- Decimal code
- Total occurrences
Data Quality Indicators
Look for these patterns in the output:
Missing/Null Values
- Low count vs expected row count
<nil> values in occurrence list
Duplicates
- Count significantly higher than unique count
- Same value appearing multiple times
Encoding Issues
- Unexpected characters in char occurrence
- Non-ASCII characters (hex > U+007F)
- Null bytes (
�)
Format Inconsistencies
- Wide range in string lengths
- Mixed formats in same column
- Special characters in unexpected places
Best Practices
- Profile first: Always profile new data sources before processing
- Check all columns: Review each field's statistics
- Look for outliers: Extreme min/max values may indicate errors
- Character analysis: Check for encoding issues, especially in text fields
- Save reports: Use
-o to save profiles for documentation
Example Workflow
dct prof incoming_data.csv -o profile.txt
dct prof cleaned_data.csv
Interpreting Results
Good Data Quality Signs
- Count matches expected row count
- Unique counts appropriate for field type
- Character distributions match expected language/encoding
- String lengths within reasonable bounds
Warning Signs
- High null counts
- Extreme string length variations
- Unexpected special characters
- Count/unique count ratio indicates duplicates
Related Skills
dct-peek: Quick preview before detailed profiling
dct-infer: Generate schema after quality check
dct-diff: Compare profiles of two file versions
Performance Notes
- Profiles entire file by default
- May be slow on very large files (>1GB)
- Consider sampling large files with
dct peek first
- Character analysis can be memory-intensive on wide text columns