| name | families-linear-algebra |
| description | Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science. Dense linear algebra and tensor operations framework extending BLAS/LAPACK. Use when: linear algebra library design, BLAS implementation, LAPACK algorithms, tensor operations, high-performance computing, dense matrix operations, scientific computing infrastructure, GPU-accelerated linear algebra, or multi-node/multi-GPU dense algebra. |
FAMLIES: Advanced Linear Algebra Infrastructure
Overview
This skill provides a framework for designing and implementing advanced dense linear and tensor algebra libraries. FAMLIES (Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science) addresses limitations of traditional BLAS/LAPACK approaches while enabling better adaptation to new hardware architectures and applications.
Key Innovation: More flexible approach to defining and implementing high-performance dense linear and tensor algorithms that adapts to changing applications, software, and hardware.
Current Landscape & Challenges
BLAS/LAPACK Legacy
Success Factors:
- De facto standard interfaces
- Stringent boundary enforcement between layers
- Wide adoption in scientific computing
- Foundation for ML/data science
Current Limitations:
- Boundary Enforcement: Impedes reducing data movement overhead
- Layer Separation: Prevents loop fusion optimization
- Hardware Adaptation: Slow to adapt to new architectures (GPUs)
- Algorithm Implementation: Delayed adoption of new algorithms
- Flexibility: Rigid structure limits innovation
Modern Challenges
Data Movement Overhead:
- Data movement dominates computation cost
- Boundaries prevent data reuse optimization
- Loop fusion opportunities missed
- Memory hierarchy not fully exploited
New Hardware Architectures:
- GPU acceleration requirements
- Multi-node/multi-GPU systems
- Heterogeneous computing platforms
- Different memory hierarchies
Emerging Applications:
- Machine learning tensor operations
- Data science sparse/dense hybrids
- New matrix factorization algorithms
- Quantum computing interfaces
FAMLIES Framework
Core Principles
- Flexibility: Adaptable to applications, software, and hardware
- Optimization: Enable data movement reduction
- Standards: Maintain interface standards where beneficial
- Performance: High-performance implementation across architectures
- Extensibility: Support new algorithms and operations
Framework Architecture
Layer Flexibility:
- Allow cross-layer optimization
- Enable loop fusion
- Support operation composition
- Flexible boundary enforcement
Hardware Adaptation:
- GPU-specific optimizations
- Multi-node coordination
- Memory hierarchy awareness
- Architecture-specific implementations
Algorithm Support:
- Dense linear algebra (BLAS/LAPACK level)
- Tensor operations (multi-linear algebra)
- New factorization algorithms
- Application-specific operations
Leveraged Projects
BLIS (BLAS-like Library Instantiation Software)
Key Features:
- Flexible BLAS implementation
- Architecture-specific optimization
- Layer decomposition
- Performance portability
Benefits:
- Rapid architecture adaptation
- Optimized implementations
- Maintainable code structure
- Community-driven development
libflame
Key Features:
- LAPACK-level functionality
- Algorithm-by-blocks approach
- Flexible implementation
- Research-oriented design
Benefits:
- Advanced algorithm implementations
- Block-based optimization
- Research platform
- Extensible framework
Dense Linear Algebra Operations
BLAS-Level Operations
Level 1: Vector operations
- DOT: dot product
- AXPY: scaled vector addition
- NORM: vector norm
- SCALE: vector scaling
Level 2: Matrix-vector operations
- GEMV: general matrix-vector multiply
- GER: general rank-1 update
- SYMV: symmetric matrix-vector multiply
- TRSV: triangular solve
Level 3: Matrix-matrix operations
- GEMM: general matrix multiply
- SYMM: symmetric matrix multiply
- TRMM: triangular matrix multiply
- SYRK: symmetric rank-k update
LAPACK-Level Operations
Factorizations:
- LU: LU decomposition
- Cholesky: Cholesky factorization
- QR: QR decomposition
- SVD: singular value decomposition
Solvers:
- Linear system solve
- Eigenvalue problems
- Least squares
- Matrix inversion
Auxiliary:
- Condition estimation
- Matrix norms
- Equilibration
- Scaling
Tensor Operations
Multi-linear Operations:
- Tensor contraction
- Tensor decomposition (Tucker, CP)
- Tensor unfolding
- Tensor permutation
Tensor Algebra:
- Einstein notation operations
- Tensor-matrix products
- Tensor-tensor products
- Tensor factorizations
Implementation Considerations
Data Movement Optimization
Loop Fusion:
GEMM(A, B, C);
SCALE(alpha, C);
GEMM_SCALE_FUSED(A, B, C, alpha);
Operation Composition:
COMPOSE({
GEMM(A, B, T1),
GEMM(T1, C, T2),
ADD(T2, D, E)
});
Memory Hierarchy:
- Cache-aware implementations
- Register blocking
- Memory bandwidth optimization
- Data locality maximization
Hardware-Specific Optimization
GPU Implementations:
- CUDA kernel design
- Shared memory utilization
- Warp-level operations
- Multi-stream execution
Multi-Node/Multi-GPU:
- Distributed matrix storage
- Communication optimization
- Load balancing
- Scalability considerations
Heterogeneous Computing:
- CPU-GPU coordination
- Device selection strategies
- Memory transfer optimization
- Hybrid implementations
Algorithm-by-Blocks Approach
Concept: Decompose algorithms into block operations
Benefits:
- Better cache utilization
- Parallelizable blocks
- Flexible implementation
- Data locality
Example: LU Factorization by Blocks:
def LU_by_blocks(A, block_size):
n = A.shape[0]
for k in range(0, n, block_size):
LU_kernel(A[k:k+bs, k:k+bs])
for i in range(k+bs, n, block_size):
solve_panel(A[k:k+bs, k:k+bs], A[i:i+bs, k:k+bs])
for j in range(k+bs, n, block_size):
GEMM(A[i:i+bs, k:k+bs],
A[k:k+bs, j:j+bs],
A[i:i+bs, j:j+bs])
Performance Optimization Techniques
Cache Blocking
Register Blocking: Keep data in registers
L1 Cache Blocking: Reuse data in L1 cache
L2 Cache Blocking: Maximize L2 utilization
TLB Blocking: Avoid TLB misses
Parallelization
Thread-Level: Multi-threaded execution
SIMD-Level: Vector instructions
GPU-Level: GPU parallelism
Distributed-Level: Multi-node parallelism
Memory Optimization
Preconditioning: Reduce memory traffic
In-place Operations: Avoid extra memory
Strided Access: Optimal memory patterns
Alignment: Cache line alignment
Framework Components
Operation Definitions
Standard Operations: BLAS/LAPACK compatible
Extended Operations: New operation types
Tensor Operations: Multi-linear algebra
Composite Operations: Operation sequences
Implementation Layer
CPU Kernels: Optimized CPU implementations
GPU Kernels: GPU-specific kernels
Hybrid Kernels: Mixed CPU-GPU
Reference Kernels: Portable reference code
Interface Layer
Standard APIs: BLAS/LAPACK compatibility
Extended APIs: New operation interfaces
Language Bindings: C, Fortran, Python, etc.
Framework APIs: High-level framework calls
Research Paper Reference
Paper: "A Proposed Framework for Advanced (Multi)Linear Infrastructure in Engineering and Science (FAMLIES)"
- Authors: Devin A. Matthews, Tze Meng Low, Margaret E. Myers, et al.
- arXiv ID: 2604.07311
- Published: April 8, 2026
- Categories: cs.MS (Mathematical Software)
- Link: https://arxiv.org/abs/2604.07311
Use Cases
Scientific Computing
Applications:
- Numerical simulations
- Finite element methods
- Computational physics
- Climate modeling
Requirements:
- High performance
- Large-scale matrices
- Distributed computing
- Specialized operations
Machine Learning
Applications:
- Neural network training
- Tensor operations
- Large-scale optimization
- Model inference
Requirements:
- GPU acceleration
- Batch processing
- Tensor operations
- High throughput
Data Science
Applications:
- Large-scale analytics
- Statistical computations
- Data preprocessing
- Dimensionality reduction
Requirements:
- Flexible operations
- Mixed sparse/dense
- Memory efficiency
- Ease of use
Related Skills
- blas-lapack: Traditional BLAS/LAPACK usage
- gpu-computing: GPU acceleration techniques
- tensor-operations: Tensor algebra fundamentals
- high-performance-computing: HPC best practices
Implementation Examples
Basic GEMM with Loop Fusion
void gemm_scale_add(
int m, int n, int k,
double alpha, double beta, double gamma,
double *A, double *B, double *C, double *D,
double *E) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
double sum = 0.0;
for (int p = 0; p < k; p++) {
sum += A[i*k+p] * B[p*n+j];
}
E[i*n+j] = alpha*sum + beta*C[i*n+j] + gamma*D[i*n+j];
}
}
}
Tensor Contraction
import numpy as np
def tensor_contract_3d(A, B, contract_dims):
"""
Contract tensors A and B along specified dimensions
Args:
A: tensor of shape (a1, a2, a3)
B: tensor of shape (b1, b2, b3)
contract_dims: dimensions to contract
Returns:
Contracted tensor
"""
if contract_dims == ((-1,), (0,)):
return np.einsum('ijk,klm->ijlm', A, B)
einsum_str = build_einsum_string(A.shape, B.shape, contract_dims)
return np.einsum(einsum_str, A, B)
Future Directions
Quantum Computing Integration:
- Quantum linear algebra
- Hybrid quantum-classical algorithms
- Quantum matrix operations
Sparse-Dense Hybrids:
- Mixed sparse/dense operations
- Selective storage strategies
- Format-aware kernels
Automatic Optimization:
- Auto-tuning frameworks
- Machine learning for optimization
- Compiler-driven optimization
See Also
- BLIS project documentation
- libflame documentation
- BLAS/LAPACK standards
- High-performance computing guides