| name | assembly-guide |
| description | Assembly language guardrails, patterns, and best practices for AI-assisted development.
Use when working with assembly files (.asm, .s, .S), or when the user mentions Assembly/x86/ARM.
Provides calling convention guidelines, register usage patterns, and debugging techniques
specific to this project's coding standards.
|
| license | MIT |
| metadata | {"author":"samuel","version":"1.0","category":"language","language":"assembly","extensions":".asm,.s,.S"} |
Assembly Guide
Applies to: x86-64 (System V ABI), ARM64 (AAPCS), NASM, GAS syntax
Core Principles
- Clarity Over Cleverness: Comment every instruction's purpose; assembly lacks self-documentation
- ABI Compliance: Follow calling conventions precisely for interoperability with C/system code
- Minimal Register Pressure: Preserve callee-saved registers, minimize spills to stack
- Correctness First: Get it working correctly, then profile, then optimize with SIMD
- Structured Layout: Use consistent label naming, section organization, and macro definitions
Guardrails
Architecture Selection
- Declare target architecture at the top of every file
- x86-64: default for Linux/macOS server and desktop workloads
- ARM64: default for Apple Silicon, mobile, and embedded Linux
- Never mix architecture-specific code without
%ifdef / .ifdef guards
Calling Conventions
- x86-64 System V ABI (Linux, macOS, BSD):
- Arguments:
rdi, rsi, rdx, rcx, r8, r9 (integer/pointer, in order)
- Floating-point arguments:
xmm0-xmm7
- Return value:
rax (integer), xmm0 (float)
- Caller-saved (volatile):
rax, rcx, rdx, rsi, rdi, r8-r11
- Callee-saved (non-volatile):
rbx, rbp, r12-r15
- Stack must be 16-byte aligned before
call instruction
- ARM64 AAPCS (Linux, macOS):
- Arguments:
x0-x7 (integer/pointer), d0-d7 (float)
- Return value:
x0 (integer), d0 (float)
- Callee-saved:
x19-x28, x29 (frame pointer), x30 (link register)
- Stack must be 16-byte aligned at all times
Register Usage
- Document which registers hold which logical values at function entry
- Never clobber callee-saved registers without saving and restoring them
- Use
rbp / x29 as frame pointer for debuggability (omit only in leaf functions)
- Reserve scratch registers for temporaries; name them in comments
- Zero-extend results when returning values smaller than 64 bits
Stack Management
- Always maintain 16-byte stack alignment on x86-64 and ARM64
- Allocate local variables by subtracting from
rsp / sp in the prologue
- Deallocate in the epilogue before
ret (never leave the stack dirty)
- Use red zone (128 bytes below
rsp) only in leaf functions on System V ABI
- Never write below the stack pointer outside the red zone
Documentation
- File header: purpose, target architecture, assembler syntax, author
- Function header: C-style prototype comment, argument register mapping, return value
- Inline comments: explain the why, not the what (avoid
; increment counter)
- Label naming:
module_function_sublabel (e.g., crypto_sha256_loop)
- Constants: use
equ / .equ directives with descriptive names
Key Patterns
x86-64 Function with Frame Pointer
; long compute(long x, long y, long z)
; Args: rdi = x, rsi = y, rdx = z
; Returns: rax = x * y + z
global compute
compute:
push rbp ; save frame pointer
mov rbp, rsp ; establish stack frame
mov rax, rdi ; rax = x
imul rax, rsi ; rax = x * y
add rax, rdx ; rax = x * y + z
pop rbp ; restore frame pointer
ret
ARM64 AAPCS Function
// int64_t multiply_add(int64_t a, int64_t b, int64_t c)
// Args: x0 = a, x1 = b, x2 = c | Returns: x0 = a * b + c
.global multiply_add
multiply_add:
stp x29, x30, [sp, #-16]! // save fp and lr
mov x29, sp // establish stack frame
mul x0, x0, x1 // x0 = a * b
add x0, x0, x2 // x0 = a * b + c
ldp x29, x30, [sp], #16 // restore fp and lr
ret
SIMD / SSE2 (4 floats per iteration)
; void add_f32(float *dst, const float *a, const float *b, size_t n)
; Args: rdi = dst, rsi = a, rdx = b, rcx = n
global add_f32
add_f32:
shr rcx, 2 ; n /= 4
.loop:
test rcx, rcx
jz .done
movups xmm0, [rsi] ; load 4 floats from a
addps xmm0, [rdx] ; add 4 floats from b
movups [rdi], xmm0 ; store result
add rsi, 16
add rdx, 16
add rdi, 16
dec rcx
jnz .loop
.done:
ret
Linux x86-64 Syscall Interface
; Syscall: rax = number, args in rdi/rsi/rdx/r10/r8/r9, return in rax
; Note: r10 replaces rcx (clobbered by syscall instruction)
SYS_WRITE equ 1
SYS_EXIT equ 60
section .data
msg db "Hello, world!", 10
msg_len equ $ - msg
section .text
global _start
_start:
mov rax, SYS_WRITE ; write(stdout, msg, msg_len)
mov rdi, 1 ; fd = STDOUT
lea rsi, [rel msg] ; RIP-relative for PIC
mov rdx, msg_len
syscall
mov rax, SYS_EXIT ; exit(0)
xor edi, edi
syscall
Position-Independent Code (PIC)
default rel ; all memory refs become RIP-relative
section .data
counter dq 0
section .text
global get_counter
get_counter:
mov rax, [counter] ; RIP-relative with default rel
ret
global increment_counter
increment_counter:
lock inc qword [counter] ; atomic increment (thread-safe)
mov rax, [counter]
ret
Debugging
GDB Commands
gdb ./program
(gdb) layout asm
(gdb) layout regs
(gdb) stepi
(gdb) nexti
(gdb) info registers
(gdb) p/x $rax
(gdb) x/4gx $rsp
(gdb) break *0x401000
(gdb) display/i $pc
(gdb) set disassembly-flavor intel
objdump & strace
objdump -d -M intel program
objdump -h program
objdump -t program
objdump -r program.o
strace ./program
strace -e trace=write,read ./program
Tooling
Assemblers & Linkers
nasm -f elf64 -g -F dwarf program.asm -o program.o
nasm -f macho64 program.asm -o program.o
as --64 -g program.s -o program.o
clang -c program.s -o program.o
ld -o program program.o
gcc -o program program.o
gcc -shared -o libfoo.so foo.o
Verification
nm program.o
nm -u program.o
readelf -S program.o
References
For detailed patterns and code examples, see:
External References