| name | data-ai-guide |
| description | Comprehensive data science, machine learning, and AI guide covering Python, deep learning, NLP, LLMs, prompt engineering, and MLOps. Use when building AI models, data pipelines, or machine learning systems. |
Data Science & AI Guide
Master data science, machine learning, generative AI, and modern AI engineering practices.
Quick Start
Python Data Science Stack
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
Deep Learning with PyTorch
import torch
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self):
super().__init__()
self.linear1 = nn.Linear(784, 128)
self.linear2 = nn.Linear(128, 10)
def forward(self, x):
x = torch.relu(self.linear1(x))
return self.linear2(x)
model = SimpleNN()
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
LLM Prompt Engineering
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
temperature=0.7
)
Data Science Path
Fundamentals
- Mathematics: Statistics, linear algebra, calculus
- Python: Libraries (Pandas, NumPy, Scikit-learn)
- Data Analysis: Exploratory analysis, visualization
- SQL: Querying and data manipulation
Machine Learning
- Supervised Learning: Regression, classification
- Unsupervised Learning: Clustering, dimensionality reduction
- Model Evaluation: Cross-validation, metrics
- Hyperparameter Tuning: Grid search, Bayesian optimization
Deep Learning
- Neural Networks: Architecture, training
- CNNs: Computer vision tasks
- RNNs: Sequence modeling
- Transformers: Modern architecture for NLP/Vision
Natural Language Processing
- Text Processing: Tokenization, embeddings
- Word Embeddings: Word2Vec, GloVe, FastText
- BERT: Contextual embeddings
- Transformers: GPT, BERT for various NLP tasks
Generative AI & LLMs
Large Language Models
- GPT Family: GPT-3.5, GPT-4 for text generation
- Claude: Constitutional AI models
- Open Source: Llama, Mistral, Zephyr
- Fine-tuning: Adapting models for specific tasks
Prompt Engineering
- Role-based Prompting: Setting context and expertise
- Few-shot Learning: Examples in prompt
- Chain-of-Thought: Step-by-step reasoning
- Retrieval Augmented Generation (RAG): Knowledge augmentation
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(embedding_function=embeddings)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
AI Agents
- Tool Use: Agents calling external tools
- Planning: Multi-step task execution
- Memory: Conversation history, context
- Evaluation: Assessing agent performance
Data Engineering
ETL Pipelines
- Apache Airflow: Workflow orchestration
- dbt: Data transformation
- Kafka: Stream processing
- Spark: Distributed processing
Big Data
- Hadoop: Distributed storage and processing
- Spark: In-memory processing framework
- Scala: Spark's native language
- Distributed Systems: Understanding CAP theorem
Data Warehousing
- Snowflake: Cloud data warehouse
- BigQuery: Google's data warehouse
- Redshift: AWS data warehouse
- Star Schema: Dimensional modeling
MLOps
Model Management
- Model Versioning: Tracking model versions
- Model Registry: MLflow, Weights & Biases
- Experiment Tracking: Monitoring training runs
- Model Cards: Documenting model capabilities
Deployment
- Model Serving: FastAPI, TFServing
- Containerization: Docker for models
- Kubernetes: Production ML deployment
- API Monitoring: Performance and data drift
Monitoring
- Data Drift: Detecting distribution changes
- Model Drift: Performance degradation
- Feature Store: Consistent feature serving
- Observability: Logging and metrics
Technology Stack
Core Libraries
- Pandas: Data manipulation
- NumPy: Numerical computing
- Scikit-learn: Machine learning
- Matplotlib/Seaborn: Visualization
- Plotly: Interactive plots
Deep Learning
- TensorFlow: Keras API, distributed training
- PyTorch: Dynamic graphs, research-friendly
- JAX: Functional programming for ML
LLM Frameworks
- LangChain: Building LLM applications
- LlamaIndex: RAG and indexing
- OpenAI API: GPT models access
- Hugging Face: Model hub and transformers
Learning Path
-
Fundamentals (3 months)
- Python programming
- Statistics and mathematics
- Data manipulation with Pandas
-
Machine Learning (3 months)
- Supervised learning
- Model evaluation
- Feature engineering
-
Deep Learning (2 months)
- Neural networks
- CNNs and RNNs
- Transformers
-
Specialization (ongoing)
- NLP / Computer Vision / Tabular Data
- LLMs and generative AI
- MLOps and production
Projects
- Iris Classification - Classic ML project
- Housing Price Prediction - Regression
- Sentiment Analysis - NLP with transformers
- Image Classification - CNN with deep learning
- LLM Chatbot - Using prompt engineering
- RAG System - Knowledge-augmented AI
- Time Series Forecasting - Stock predictions
Resources
Learning Platforms
- Coursera: Andrew Ng's ML course
- Fast.ai: Practical deep learning
- DataCamp: Interactive data science
- Kaggle: Competitions and datasets
Documentation
Roadmap.sh Reference: https://roadmap.sh/ai-engineer