一键在 Manus 中运行任何 Skill

tla-spec

星标1

分支0

更新时间2026年4月2日 04:22

Use when writing or reviewing TLA+ specifications for coordination protocols, when verifying safety/liveness properties of distributed algorithms, or when TLC model checking fails and you need diagnostic guidance. Evidence-backed TLA+ correctness methodology.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

ahrav

ahrav/Gossip-rs

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

TLA+ Specification Correctness Guide

A phased methodology for writing correct TLA+ specifications. Every rule, pattern, and recommendation traces to established literature, Lamport's own specifications, or battle-tested industrial practice. No guessing.

This skill guides you through specification from foundations to model checking configuration. Follow the phases in order for new specs; jump to specific phases for targeted reviews.

Evidence Base

ID	Source	Key Principle
M1	Lamport, Specifying Systems (2002)	Canonical form: `Init /\ [][Next]_vars /\ Fairness`
M2	Lamport, Specifying Systems Ch. 6	Stuttering: `[][Next]_vars` permits stuttering steps; compositional refinement
M3	Lamport, Specifying Systems Ch. 8	Liveness: temporal properties, fairness as specification component
M4	Lamport, The Temporal Logic of Actions (1994)	Weak vs strong fairness definitions and selection criteria
M5	Lamport, Specifying Systems Ch. 5	UNCHANGED: every action must specify all unmodified variables
M8	Lamport, Specifying Systems Ch. 2-4	Abstraction: model WHAT not HOW, separate interface from implementation
M9	Lamport, TLA+ community practice	TypeOK as first invariant constraining variable domains
M10	Newcombe et al., CACM 2015	AWS experience: 3-5 processes sufficient for most design bugs
M11	TLC reference manual	TLC config options and their soundness guarantees
M12	Yu et al., Model Checking TLA+ Specifications (1999)	TLC implementation, symmetry reduction, state constraints
M14	Lamport, Specifying Systems; Raft TLA+ spec	Distributed systems patterns in TLA+
M16	Lamport, A PlusCal User's Manual (2009)	PlusCal as algorithmic specification language compiling to TLA+

Confidence Tiers

Every rule is tagged with a confidence tier. Higher tiers have stronger evidence and should be followed more strictly.

Tier	Meaning	Compliance
T1	High confidence, strong evidence from multiple sources	Non-negotiable for correctness
T2	Good evidence, note caveats	Strong default, deviate with documented reason
T3	Use with caution, limited evidence	Useful guidance, verify in your context

When to Use

Writing a new TLA+ specification for any protocol or algorithm
Reviewing an existing TLA+ specification for correctness
Adding liveness properties to a safety-only specification
Configuring TLC for model checking
Modeling distributed systems protocols (consensus, replication, leader election)
Deciding between PlusCal and TLA+ for a new specification

When NOT to Use

Writing implementation code (this is about specifications, not programs)
Working with other formal methods (Alloy, SPIN, Coq) — different tools, different rules
Simple sequential algorithms where testing suffices
When you need a TLA+ syntax tutorial (this assumes basic TLA+ literacy)

Phase 1: Specification Foundations

Establish the structural skeleton of a correct specification. Every TLA+ specification must satisfy these non-negotiable fundamentals before any properties are checked.

1.1 Canonical Form [T1, M1, P4.1.F1]

Use the canonical form as the default template for all specifications.

Safety-only specification (no liveness requirements):

---- MODULE MyProtocol ----
EXTENDS Integers, Sequences, FiniteSets, TLC

CONSTANTS
    Node,       \* Set of participating nodes
    MaxValue    \* Domain bound for model checking

VARIABLES
    state,      \* Per-node local state
    msgs        \* Set of in-flight messages (or Bag for realism)

vars == <<state, msgs>>

------------------------------------------------------------
\* Type invariant — ALWAYS define first [T1, M9]
TypeOK ==
    /\ state \in [Node -> {"idle", "active", "done"}]
    /\ msgs \subseteq MessageType

------------------------------------------------------------
\* Initial state
Init ==
    /\ state = [n \in Node |-> "idle"]
    /\ msgs = {}

------------------------------------------------------------
\* Actions — each MUST specify UNCHANGED for all unmodified variables [T1, M5]
ActionA(n) ==
    /\ state[n] = "idle"
    /\ state' = [state EXCEPT ![n] = "active"]
    /\ UNCHANGED <<msgs>>

ActionB(n) ==
    /\ state[n] = "active"
    /\ msgs' = msgs \cup {[type |-> "done", from |-> n]}
    /\ UNCHANGED <<state>>

------------------------------------------------------------
\* Next-state relation
Next ==
    \E n \in Node :
        \/ ActionA(n)
        \/ ActionB(n)

------------------------------------------------------------
\* Specification — canonical form
Spec == Init /\ [][Next]_vars

====

Safety + liveness specification (add fairness only when needed):

\* Add ONLY for liveness properties [T1, M1, M3]
Fairness ==
    /\ \A n \in Node : WF_vars(ActionA(n))
    /\ \A n \in Node : WF_vars(ActionB(n))

Spec == Init /\ [][Next]_vars /\ Fairness

Caveat [P4.1.F1]: Canonical form is a default template, not a law. Lamport himself deviates in some specifications (e.g., omitting fairness entirely for safety-only specs, using different temporal formulas for specific refinement mappings). If you deviate, document WHY with a reference to the specific pattern you are following.

1.2 Stuttering Invariance Checklist [T1, M2, M5]

Stuttering steps (where no variable changes) are fundamental to TLA+ and enable compositional refinement. The [][Next]_vars form explicitly permits them — this is not optional.

STUTTERING CHECKLIST
────────────────────
[ ] Specification uses [][Next]_vars (NOT [][Next])
    → [][Next] without _vars FORBIDS stuttering and breaks refinement.
    → The ONLY valid form is [][Next]_vars where vars is the
      tuple of ALL specification variables.

[ ] vars tuple contains ALL declared VARIABLES
    → If you add a variable, update the vars tuple immediately.
    → Missing a variable from vars silently permits arbitrary changes to it.

[ ] Every action specifies UNCHANGED for ALL unmodified variables
    → An action that mentions v1' but not v2' leaves v2 unconstrained.
    → TLC will explore states where v2 takes ANY value in its type domain.
    → This is the #1 source of spurious counterexamples.

Variable grouping pattern for specs with many variables:

\* Group related variables for manageable UNCHANGED clauses
nodeVars    == <<state, epoch>>
networkVars == <<msgs, delivered>>
configVars  == <<members, leader>>

vars == <<nodeVars, networkVars, configVars>>

\* Actions use group UNCHANGED
SendMessage(n) ==
    /\ ...
    /\ msgs' = msgs \cup {msg}
    /\ UNCHANGED <<nodeVars, configVars, delivered>>

1.3 TypeOK as First Invariant [T1, M9]

TypeOK constrains the domains of all variables. It is always the first invariant you write and the first invariant you check.

Why TypeOK matters:

Catches variable domain errors before expensive property checking
Acts as a sanity check that Init and Next are well-formed
Provides a foundation for TLAPS proofs (type correctness is a proof obligation)
Rapidly detects missing UNCHANGED clauses (unconstrained variables violate TypeOK)

TypeOK construction rules:

TYPEOK CONSTRUCTION
───────────────────
[ ] Every VARIABLE appears in TypeOK

[ ] TypeOK constrains variables to finite domains for TLC
    → "x \in Nat" is valid TLA+ but NOT model-checkable
    → Use "x \in 0..MaxBound" for model checking

[ ] TypeOK is checked BEFORE other invariants
    → In the TLC config: INVARIANT TypeOK, SafetyProp1, SafetyProp2
    → TypeOK violations are always the first thing to debug

[ ] Function-valued variables use domain constraints
    → state \in [Node -> StatusSet] not just "state is a function"
    → Check both the domain AND the range

TypeOK patterns by variable type:

Variable Type	TypeOK Pattern	Example
Simple value	`v \in DomainSet`	`epoch \in Nat` (or `0..MaxEpoch` for TLC)
Function	`v \in [Domain -> Range]`	`state \in [Node -> {"idle","active"}]`
Set	`v \subseteq BaseSet`	`msgs \subseteq MessageType`
Sequence	`v \in Seq(ElemType)`	`log \in Seq(Entry)` (use `BSeq(Entry, MaxLen)` for TLC)
Record	`v \in [field1: T1, field2: T2]`	`msg \in [type: MsgTypes, from: Node]`
Optional	`v \in T \cup {None}`	`leader \in Node \cup {None}`
Bag/Multiset	Custom predicate	`BagToSet(msgs) \subseteq MessageType`

Phase 2: Abstraction and Modeling

Select the right level of abstraction BEFORE writing specification logic. Wrong abstraction is the most expensive specification error — it wastes model-checking time on irrelevant detail or misses real bugs by over-abstracting.

2.1 Seven-Step Abstraction Framework [T2, P3.1, M8]

Follow these steps in order when starting a new specification.

Step 1: State the property first

Before modeling any behavior, write down the property you want to verify in plain English. This determines what MUST be in the specification and what can be abstracted away.

Example:
  Property: "No two nodes hold the lock simultaneously"
  → Need: lock state per node
  → Don't need: network packet format, serialization, retry timers

Step 2: Identify minimal state

List the minimum set of variables needed to express the property from Step 1. Every variable must appear in at least one property or be necessary for computing a variable that does.

Minimal state test: For each variable, ask:
  "If I remove this variable, can I still express all my properties?"
  If yes → remove it.

Step 3: Model the environment as nondeterminism

Do not model external systems (clients, networks, OS) in detail. Replace them with nondeterministic choices that capture all possible behaviors.

\* BAD: modeling a specific client
ClientSend == /\ clientState = "ready"
              /\ clientState' = "waiting"
              /\ ...

\* GOOD: nondeterministic environment
ReceiveRequest(n) == \* A request can arrive at any node, any time
    /\ state[n] = "idle"
    /\ state' = [state EXCEPT ![n] = "processing"]
    /\ UNCHANGED <<...>>

Step 4: Choose communication model

Model	TLA+ Representation	Use When	Tradeoff
Set of messages	`msgs \subseteq MsgType`	Safety properties only	Simplest; loses ordering and duplicates
Bag (multiset)	`msgs \in BagType`	Need duplicate detection	Realistic; moderate complexity
Sequence per link	`chan \in [Link -> Seq(MsgType)]`	FIFO ordering required	Most complex; needed for TCP-like channels

[T2, P3.4, M14]: For most distributed protocol specifications, start with a set model. Upgrade to bag/sequence only if a property depends on message ordering or duplicate counting.

Step 5: Separate WHAT from HOW

Specify the intended behavior, not the implementation mechanism.

\* BAD: specifying HOW (implementation detail)
AcquireLock(n) ==
    /\ Send([type |-> "lock_req", from |-> n])
    /\ WaitForReply(n)
    /\ ProcessReply(n)

\* GOOD: specifying WHAT (observable behavior)
AcquireLock(n) ==
    /\ lock = None
    /\ lock' = n
    /\ UNCHANGED <<...>>

Step 6: Diagnose over/under-specification

Symptom	Diagnosis	Fix
TLC explores billions of states for a 3-node model	Over-specified (too much detail)	Remove implementation-level state
Safety property passes but system is clearly wrong	Under-specified (missing behaviors)	Add actions for failure cases, concurrent operations
Liveness property fails with trivial counterexample	Missing fairness or missing actions	Check that all intended behaviors have actions
No errors but the spec "feels wrong"	Wrong abstraction level	Return to Step 1, re-examine the property

Step 7: Refine incrementally

Start with the simplest correct spec and add detail only when needed to express properties that the simpler spec cannot. Each refinement step should be verifiable: the refined spec should refine (implement) the abstract spec.

2.2 Ten Distributed Systems Patterns [T2, P3.4, M14]

Common modeling patterns for distributed protocols. Each pattern includes the TLA+ idiom and notes on when to use it.

Pattern 1: Quorum

\* Abstract: CONSTANT set of quorums
CONSTANT Quorum
ASSUME QuorumAssumption ==
    /\ \A Q \in Quorum : Q \subseteq Node
    /\ \A Q1, Q2 \in Quorum : Q1 \cap Q2 /= {}

\* OR computed majority (simpler for symmetric systems)
Quorum == {S \in SUBSET Node : Cardinality(S) * 2 > Cardinality(Node)}

For model checking, the computed form is often better (no need to enumerate quorums in the config). For proof, the abstract CONSTANT form is more general.

Pattern 2: Epoch / Term (Monotonic Counter)

\* Per-node monotonic epoch
VARIABLE currentEpoch  \* currentEpoch \in [Node -> Nat]

\* Epoch advances are strictly increasing per node
AdvanceEpoch(n) ==
    /\ currentEpoch' = [currentEpoch EXCEPT ![n] = currentEpoch[n] + 1]
    /\ ...

\* Stale-epoch rejection
ProcessMessage(n, msg) ==
    /\ msg.epoch >= currentEpoch[n]  \* reject stale
    /\ ...

Pattern 3: Leader Election (Nondeterministic)

\* Abstract leader election — nondeterministic candidacy
BecomeLeader(n) ==
    /\ \A m \in Node : leader[m] /= n  \* not already leader
    /\ leader' = [leader EXCEPT ![n] = n]
    /\ currentEpoch' = [currentEpoch EXCEPT ![n] = currentEpoch[n] + 1]
    /\ UNCHANGED <<...>>

Do NOT model the election protocol (Raft, ZAB) unless you are verifying the election protocol itself. Abstract it as nondeterministic leader selection with appropriate constraints.

Pattern 4: Timeout (Always-Enabled Action, NO Clock)

\* BAD: modeling a clock for timeouts
Timeout(n) ==
    /\ clock[n] >= deadline[n]  \* NO — do not model clocks
    /\ ...

\* GOOD: timeout as an always-enabled action
Timeout(n) ==
    /\ state[n] = "waiting"
    /\ state' = [state EXCEPT ![n] = "timed_out"]
    /\ UNCHANGED <<...>>
\* Timeout can fire at ANY time a node is waiting — this captures
\* all possible timeout behaviors without modeling time.

[T1]: Never model real-time clocks in a TLA+ specification unless you are specifically verifying real-time properties. Timeouts are modeled as always-enabled actions that capture the nondeterminism of when they fire.

Pattern 5: Log Replication (Sequence with Truncate-Overwrite)

\* Log as sequence with append and truncate-overwrite
VARIABLE log  \* log \in [Node -> Seq(Entry)]

AppendEntry(n, entry) ==
    /\ log' = [log EXCEPT ![n] = Append(log[n], entry)]
    /\ UNCHANGED <<...>>

TruncateAndOverwrite(n, idx, entries) ==
    /\ log' = [log EXCEPT ![n] = SubSeq(log[n], 1, idx) \o entries]
    /\ UNCHANGED <<...>>

Pattern 6: Membership Change (Config as Log Entry)

\* Joint consensus: configuration transitions via log
VARIABLE config  \* config \in [Node -> SUBSET Node]

\* Configuration change is a special log entry
ProposeConfigChange(leader, newMembers) ==
    /\ AppendEntry(leader, [type |-> "config", members |-> newMembers])
    /\ config' = [config EXCEPT ![leader] = config[leader] \cup newMembers]
    /\ UNCHANGED <<...>>

Pattern 7: Idempotency (Accumulating Set)

\* Track processed operation IDs to ensure idempotency
VARIABLE processed  \* processed \in [Node -> SUBSET OpId]

ProcessOp(n, op) ==
    /\ op.id \notin processed[n]  \* skip if already processed
    /\ \* ... perform operation ...
    /\ processed' = [processed EXCEPT ![n] = processed[n] \cup {op.id}]
    /\ UNCHANGED <<...>>

Pattern 8: Two-Phase Protocol (Prepare-Collect-Decide)

\* Two-phase pattern: prepare → collect votes → decide
VARIABLE phase    \* [Node -> {"idle", "prepare", "decide"}]
VARIABLE votes    \* [Node -> SUBSET Node]

Prepare(n) ==
    /\ phase[n] = "idle"
    /\ phase' = [phase EXCEPT ![n] = "prepare"]
    /\ votes' = [votes EXCEPT ![n] = {}]
    /\ \* send prepare messages
    /\ UNCHANGED <<...>>

ReceiveVote(n, voter) ==
    /\ phase[n] = "prepare"
    /\ votes' = [votes EXCEPT ![n] = votes[n] \cup {voter}]
    /\ UNCHANGED <<...>>

Decide(n) ==
    /\ phase[n] = "prepare"
    /\ votes[n] \in Quorum  \* collected a quorum
    /\ phase' = [phase EXCEPT ![n] = "decide"]
    /\ UNCHANGED <<...>>

Pair with a refinement mapping to show this implements a simpler atomic specification. [T2]

Pattern 9: Failure and Restart (Durable vs Volatile State)

\* Failure: volatile state lost, durable state preserved
VARIABLE volatileState  \* lost on restart
VARIABLE durableState   \* preserved on restart

Crash(n) ==
    /\ volatileState' = [volatileState EXCEPT ![n] = InitVolatile]
    /\ UNCHANGED <<durableState, ...>>
    \* Durable state is UNCHANGED — this is the key invariant

Restart(n) ==
    /\ \* recover from durableState[n]
    /\ volatileState' = [volatileState EXCEPT
                           ![n] = RecoverFrom(durableState[n])]
    /\ UNCHANGED <<durableState, ...>>

Pattern 10: Ghost Variables (Append-Only History)

\* Ghost variable: records history for property checking
\* Does NOT affect system behavior — only used in invariants/properties
VARIABLE history  \* history \in Seq(Event)

\* Append to history in every action that is relevant
ActionWithHistory(n) ==
    /\ \* ... normal action logic ...
    /\ history' = Append(history, [type |-> "action", node |-> n, ...])
    /\ UNCHANGED <<...>>

\* Use ghost variable in properties
NoDoubleCommit ==
    \A i, j \in 1..Len(history) :
        /\ history[i].type = "commit"
        /\ history[j].type = "commit"
        /\ history[i].key = history[j].key
        => history[i].value = history[j].value

Ghost variables are specification-only artifacts. They are append-only (history never shrinks) and used solely to express properties that reference past states. Mark them clearly in comments.

Phase 3: Safety Verification

After establishing the specification structure (Phase 1) and modeling decisions (Phase 2), verify safety properties.

3.1 Safety Specification Checklist [T1, M9, M5]

SAFETY CHECKLIST
────────────────
[ ] TypeOK is defined and checked first
    → Every variable constrained to its domain [T1, M9]

[ ] All safety invariants expressed as state predicates
    → "P is always true" written as INVARIANT P in TLC config
    → For action properties, use [][ActionProp]_vars form

[ ] Deadlock-freedom checked between safety and liveness phases
    → TLC checks deadlock by default (all states have a Next step)
    → Intentional termination: add a Done action or use
      PROPERTY [][Next]_vars => <>[]done to distinguish
    → [P4.4.F6]: ALWAYS check deadlock-freedom after safety passes
      but BEFORE adding liveness. A deadlocked spec trivially
      satisfies some liveness properties (vacuous truth).

[ ] No CONSTRAINT in the TLC config for the final check
    → CONSTRAINT is acceptable during development to speed iteration
    → CONSTRAINT MUST be removed for the final safety verification
    → [T1, M11]: CONSTRAINT can hide states reachable via longer paths

[ ] Invariants are inductive (for TLAPS proofs)
    → An invariant I is inductive if: I /\ Next => I'
    → TypeOK is typically inductive; complex safety properties
      may need strengthening (adding auxiliary conjuncts)

3.2 Deadlock-Freedom Gate [T1, P4.4.F6]

This check is mandatory between the safety and liveness phases.

A specification deadlocks if some reachable state has no Next step enabled. TLC checks this by default. However:

If your protocol has intentional termination, you must distinguish it from unintentional deadlock:

\* Termination action — prevents false deadlock reports
Done ==
    /\ AllNodesFinished
    /\ UNCHANGED vars

Next == ActionA \/ ActionB \/ ... \/ Done

A deadlocked spec can vacuously satisfy liveness properties. If the system deadlocks in a state where the liveness property is already satisfied, TLC will report no error. This is not a real liveness guarantee.

DEADLOCK GATE
─────────────
[ ] TLC reports zero deadlock states
    OR
[ ] All "deadlock" states are intentional termination states
    AND they are documented in the spec
    AND a Done action is present

3.3 TLC Safety Configuration [T1, M11, M12]

Recommended TLC configuration for safety checking:

\* MyProtocol.cfg — safety checking configuration
SPECIFICATION Spec

\* Type invariant first, then safety properties
INVARIANT TypeOK
INVARIANT SafetyProperty1
INVARIANT SafetyProperty2

\* State constraint for development ONLY — remove for final check
\* CONSTRAINT StateConstraint

\* Symmetry sets — safe for safety properties [T1, M11]
SYMMETRY NodeSymmetry

\* Constants
CONSTANT Node = {n1, n2, n3}
CONSTANT MaxValue = 3

TLC Option	Safety Checking	Notes
`INVARIANT`	Sound	Always use for state predicates
`CONSTRAINT`	Sound but incomplete	May miss states reachable via longer paths; remove for final check
`SYMMETRY`	Sound if symmetry set is correct	Incorrect symmetry sets produce unsound results silently
`VIEW`	Sound	Reduces state space via abstraction
`ACTION_CONSTRAINT`	Sound	Restricts which actions TLC explores

Phase 4: Liveness Verification

Liveness properties ("something good eventually happens") require fairness assumptions and careful TLC configuration. This is where most soundness errors occur.

4.1 Six-Phase Liveness Methodology [T1, M3, P3.2, P3.3]

Follow these phases strictly in order.

Phase L1: Write safety first (complete Phase 3)

Never add liveness to a specification that has not passed safety checking. Liveness debugging on an incorrect safety spec is wasted effort.

Phase L2: Identify liveness requirements

Write down what MUST eventually happen in plain English:

Examples:
  - "Every request eventually receives a response"
  - "Every node eventually learns of every update"
  - "The leader eventually commits every proposed entry"
  - "The system eventually reaches a stable configuration"

Express each as a temporal formula:

\* "Every request eventually receives a response"
Liveness1 == \A r \in requests : <>(responded[r])

\* "The leader eventually commits" (conditional on leadership)
Liveness2 == []<>(committed = proposed)

\* Response: "always eventually" (repeated liveness)
Liveness3 == []<>(\E n \in Node : state[n] = "active")

Phase L3: Select fairness — WF first, SF only when needed [T2, M4, P3.3]

Use the fairness selection decision table:

Condition	Fairness	Rationale
Action is continuously enabled once enabled	`WF_vars(Action)`	WF suffices: if enabled forever, it eventually happens
Action is repeatedly enabled then disabled (intermittent)	`SF_vars(Action)`	SF needed: must happen if repeatedly enabled, even if intermittently disabled
Action represents environment/external input	No fairness	Never make the environment fair — it models adversarial behavior
Action is the composite `Next`	No fairness on `Next` alone	`WF_vars(Next)` says "some action happens" but does not guarantee WHICH action happens; useless for most properties
Timeout action	Usually no fairness	Timeouts model nondeterminism; fairness on timeouts constrains which timeouts fire

WF vs SF decision procedure:

Start with WF_vars(A) for every system action A
Check if TLC finds a liveness counterexample
If the counterexample shows Action A repeatedly enabled then disabled without executing, upgrade to SF_vars(A) for that specific action
Never start with SF — it is a stronger assumption and may hide real liveness bugs

[T2, M4]: Lamport's convention: "Weak fairness is almost always what you want." Strong fairness is needed only for actions that are intermittently disabled by other actions (e.g., a receive action that is enabled only when a message is in transit, and messages can be repeatedly consumed and re-sent).

Phase L4: Verify machine closure (automatic in canonical form)

A specification Spec == Init /\ [][Next]_vars /\ Fairness is machine-closed if the safety part (Init /\ [][Next]_vars) does not constrain which infinite behaviors are possible (only fairness does that).

In the canonical form, machine closure is guaranteed [T1, M1]. You only need to worry about machine closure if you deviate from canonical form (e.g., adding raw temporal formulas to the spec).

Phase L5: Model-check liveness [T1, P3.2, M11, M12]

THIS IS THE CRITICAL STEP. TLC's liveness checking has soundness traps that do not exist in safety checking.

LIVENESS MODEL-CHECKING RULES
──────────────────────────────
[ ] REMOVE all CONSTRAINT directives from the TLC config
    → CONSTRAINT + liveness = UNSOUND [T1, M11]
    → CONSTRAINT can hide infinite behaviors that violate liveness
    → This is NON-NEGOTIABLE for liveness checking

[ ] REMOVE all SYMMETRY directives from the TLC config
    → SYMMETRY + liveness = DANGEROUS [T1, M11]
    → Symmetry reduction may cause TLC to miss liveness violations
    → It is sometimes safe but the conditions are subtle;
      removing it is always safe

[ ] Use -lncheck final for production liveness verification
    → -lncheck final performs the complete liveness check
    → Without -lncheck, TLC may use an incomplete algorithm

[ ] Liveness properties go in PROPERTY (not INVARIANT)
    → INVARIANT is for state predicates (safety)
    → PROPERTY is for temporal formulas (liveness)

Phase L6: Debug liveness failures [T1, P3.2]

TLC liveness counterexamples have a lasso structure: a finite prefix followed by a back-loop to a previously visited state. The back-loop represents an infinite behavior where the liveness property is never satisfied.

Lasso structure:
  s0 → s1 → s2 → ... → sk → ... → sj
                               ↑        |
                               └────────┘
                             (back-loop: infinite repetition)

Debugging procedure:
1. Read the finite prefix: how does the system reach state sk?
2. Read the back-loop: what actions repeat forever?
3. Check: is there an action that SHOULD fire but doesn't?
   → If yes: add or strengthen fairness for that action
4. Check: is the counterexample realistic?
   → Unrealistic: the environment acts adversarially in a way your
     system need not tolerate. Consider if your liveness requirement
     is too strong.
   → Realistic: the protocol has a liveness bug. Fix the protocol.
5. Check: does the back-loop involve only stutter steps?
   → If yes: your specification may be missing an action.

4.2 Complete TLC Soundness Matrix [T1, M11, M12]

This is the complete map of TLC configuration options versus soundness for safety and liveness checking. Reference this table before every TLC run.

Config Option	Safety	Liveness	Risk	Recommendation
INVARIANT	Sound	N/A (wrong directive)	None for safety	Use for state predicates
PROPERTY	N/A (wrong directive)	Sound (with caveats below)	None	Use for temporal formulas
CONSTRAINT	Sound (but incomplete)	UNSOUND	High: hides reachable states from liveness check	Remove for final liveness check; acceptable during development for safety only
SYMMETRY	Sound (if correct)	Dangerous	Medium: may miss counterexamples	Remove for liveness; use only for safety with verified symmetry set
VIEW	Sound	Unverified	Medium: hash collisions possible	Use with caution; verify key results without VIEW
ACTION_CONSTRAINT	Sound	Context-dependent	Medium: may prevent actions needed for liveness counterexample	Review carefully for liveness
-lncheck final	N/A	Required for soundness	High if omitted: may use incomplete algorithm	Always use for production liveness checks
-workers N (N>1)	Sound	Sound (TLC 2.18+)	Low: older TLC versions had bugs with multi-worker liveness	Use current TLC version

[T1, P3.2]: The safe liveness workflow: (1) pass all safety checks with CONSTRAINT/SYMMETRY, (2) remove CONSTRAINT and SYMMETRY, (3) add PROPERTY directives, (4) run with -lncheck final, (5) accept the longer runtime.

Phase 5: Model Checking Configuration

5.1 Small Model Heuristics [T2, M10, P4.4.F3]

AWS's experience (Newcombe et al., CACM 2015): 3-5 processes are sufficient for finding most design bugs. Start small, increase only when needed.

Model Size	When Sufficient	When Insufficient
2 nodes	Mutual exclusion, simple leader-follower	Quorum-based protocols (need 3+)
3 nodes	Most consensus protocols, quorum systems	Protocols with f+1 requirements where f>1
4-5 nodes	Complex fault-tolerance (f=1 with 2f+1)	Threshold-dependent bugs (e.g., f=2 requires 5+ nodes)
>5 nodes	Byzantine fault tolerance (3f+1)	Rarely needed for design verification

Caveat [P4.4.F3]: Small models miss:

Threshold-dependent bugs: Properties that only fail at specific quorum sizes (e.g., a bug that manifests only with exactly 2 of 5 nodes failing)

Symmetry-breaking bugs: Bugs that require specific asymmetric configurations

Scaling-related issues: Performance properties that degrade non-linearly

If your protocol has threshold-dependent behavior, test at least one configuration AT the threshold and one ABOVE it.

5.2 TLC Configuration Guide

Development configuration (fast iteration):

SPECIFICATION Spec
INVARIANT TypeOK
INVARIANT SafetyProperty
CONSTRAINT SmallStateConstraint   \* OK for safety development
SYMMETRY NodeSymmetry             \* OK for safety development
CONSTANT Node = {n1, n2, n3}     \* Start small

Final safety configuration (production):

SPECIFICATION Spec
INVARIANT TypeOK
INVARIANT SafetyProperty
\* NO CONSTRAINT — full state space
SYMMETRY NodeSymmetry             \* OK for safety (if correct)
CONSTANT Node = {n1, n2, n3}

Final liveness configuration (production):

SPECIFICATION Spec
INVARIANT TypeOK                  \* Keep safety checks active
PROPERTY LivenessProperty
\* NO CONSTRAINT — UNSOUND with liveness
\* NO SYMMETRY — DANGEROUS with liveness
CONSTANT Node = {n1, n2, n3}

TLC command-line flags:

Flag	Purpose	When to Use
`-workers auto`	Use all CPU cores	Always (production runs)
`-lncheck final`	Complete liveness check	Always for liveness verification
`-deadlock`	Disable deadlock checking	Only if spec has intentional terminal states AND Done action is present
`-depth N`	Limit search depth	Development only; never for final verification
`-coverage N`	Print coverage stats every N minutes	Development; helps find unreachable actions
`-dump format file`	Dump state graph	Debugging counterexamples
`-simulate num=N`	Random simulation instead of full BFS	Very large state spaces where exhaustive check is infeasible

5.3 Multi-Tool Awareness [T3, P4.3.F6, P4.4.F2]

TLC is not the only TLA+ verification tool. Be aware of alternatives and their strengths.

Tool	Technique	Strengths	Limitations
TLC	Explicit-state model checking	Complete for finite models; well-understood soundness; fast for small models	Exponential state explosion; finite models only
Apalache	Symbolic model checking (SMT)	Handles larger state spaces; bounded model checking for unbounded domains; type checking	Slower per-state; less mature than TLC; limited liveness support
TLAPS	Interactive theorem proving	Full mathematical proofs; unbounded verification; strongest guarantee	High manual effort; steep learning curve
Quint	Emerging: TLA+-like language with better tooling	Modern syntax; integrated testing; simulation	Not yet production-proven; subset of TLA+ expressiveness

TLC-specific advice is marked throughout this skill. Where behavior differs in Apalache, notes are included. However, this skill primarily targets TLC workflows. For Apalache-specific guidance, consult the Apalache documentation.

Apalache type annotations: Apalache requires type annotations that TLC ignores. If you plan to use both tools, add Apalache type annotations as comments:
\* @type: Set(NODE) => Bool;
IsQuorum(S) == Cardinality(S) * 2 > Cardinality(Node)

Phase 6: PlusCal Considerations [T3, M16, P4.4.F1]

PlusCal is a higher-level algorithmic language that compiles to TLA+. It is a valid choice for certain specifications.

6.1 PlusCal Decision Table

Factor	PlusCal	Raw TLA+
Author familiarity	Imperative/sequential programming	Set theory, temporal logic
Specification type	Sequential or simple concurrent algorithms	Complex concurrent/distributed protocols
Expressiveness ceiling	Limited: sequential composition, limited nondeterminism	Full: arbitrary temporal logic, flexible composition
Compilation	Compiles to TLA+ (can inspect/modify output)	Direct
Fairness	Automatic per-process fairness	Manual fairness specification
Best for	Teaching, prototyping, algorithm verification	Protocol verification, refinement, industrial use

[T3, P4.4.F1]: A finished PlusCal specification is better than an unfinished TLA+ specification. If PlusCal gets you to a checkable spec faster, use it. But know the ceiling: PlusCal cannot express all TLA+ patterns (e.g., arbitrary interleaving of non-process actions, complex fairness, stuttering-sensitive refinement).

When to switch from PlusCal to TLA+:

You need fairness on individual actions (not just processes)

You need to compose specifications via refinement

You need ghost variables that participate in the temporal formula

PlusCal's process model does not match your system's concurrency model

Quick Reference

Common mistakes catalog, specification template, decision tables, evidence traceability matrix, and references are in references/quick-reference.md.

name	tla-spec
description	Use when writing or reviewing TLA+ specifications for coordination protocols, when verifying safety/liveness properties of distributed algorithms, or when TLC model checking fails and you need diagnostic guidance. Evidence-backed TLA+ correctness methodology.
user-invocable	true

TLA+ Specification Correctness Guide

This skill guides you through specification from foundations to model checking configuration. Follow the phases in order for new specs; jump to specific phases for targeted reviews.

Evidence Base

ID	Source	Key Principle
M1	Lamport, Specifying Systems (2002)	Canonical form: `Init /\ [][Next]_vars /\ Fairness`
M2	Lamport, Specifying Systems Ch. 6	Stuttering: `[][Next]_vars` permits stuttering steps; compositional refinement
M3	Lamport, Specifying Systems Ch. 8	Liveness: temporal properties, fairness as specification component
M4	Lamport, The Temporal Logic of Actions (1994)	Weak vs strong fairness definitions and selection criteria
M5	Lamport, Specifying Systems Ch. 5	UNCHANGED: every action must specify all unmodified variables
M8	Lamport, Specifying Systems Ch. 2-4	Abstraction: model WHAT not HOW, separate interface from implementation
M9	Lamport, TLA+ community practice	TypeOK as first invariant constraining variable domains
M10	Newcombe et al., CACM 2015	AWS experience: 3-5 processes sufficient for most design bugs
M11	TLC reference manual	TLC config options and their soundness guarantees
M12	Yu et al., Model Checking TLA+ Specifications (1999)	TLC implementation, symmetry reduction, state constraints
M14	Lamport, Specifying Systems; Raft TLA+ spec	Distributed systems patterns in TLA+
M16	Lamport, A PlusCal User's Manual (2009)	PlusCal as algorithmic specification language compiling to TLA+

Confidence Tiers

Every rule is tagged with a confidence tier. Higher tiers have stronger evidence and should be followed more strictly.

Tier	Meaning	Compliance
T1	High confidence, strong evidence from multiple sources	Non-negotiable for correctness
T2	Good evidence, note caveats	Strong default, deviate with documented reason
T3	Use with caution, limited evidence	Useful guidance, verify in your context

When to Use

Writing a new TLA+ specification for any protocol or algorithm
Reviewing an existing TLA+ specification for correctness
Adding liveness properties to a safety-only specification
Configuring TLC for model checking
Modeling distributed systems protocols (consensus, replication, leader election)
Deciding between PlusCal and TLA+ for a new specification

When NOT to Use

Writing implementation code (this is about specifications, not programs)
Working with other formal methods (Alloy, SPIN, Coq) — different tools, different rules
Simple sequential algorithms where testing suffices
When you need a TLA+ syntax tutorial (this assumes basic TLA+ literacy)

Phase 1: Specification Foundations

Establish the structural skeleton of a correct specification. Every TLA+ specification must satisfy these non-negotiable fundamentals before any properties are checked.

1.1 Canonical Form [T1, M1, P4.1.F1]

Use the canonical form as the default template for all specifications.

Safety-only specification (no liveness requirements):

---- MODULE MyProtocol ----
EXTENDS Integers, Sequences, FiniteSets, TLC

CONSTANTS
    Node,       \* Set of participating nodes
    MaxValue    \* Domain bound for model checking

VARIABLES
    state,      \* Per-node local state
    msgs        \* Set of in-flight messages (or Bag for realism)

vars == <<state, msgs>>

------------------------------------------------------------
\* Type invariant — ALWAYS define first [T1, M9]
TypeOK ==
    /\ state \in [Node -> {"idle", "active", "done"}]
    /\ msgs \subseteq MessageType

------------------------------------------------------------
\* Initial state
Init ==
    /\ state = [n \in Node |-> "idle"]
    /\ msgs = {}

------------------------------------------------------------
\* Actions — each MUST specify UNCHANGED for all unmodified variables [T1, M5]
ActionA(n) ==
    /\ state[n] = "idle"
    /\ state' = [state EXCEPT ![n] = "active"]
    /\ UNCHANGED <<msgs>>

ActionB(n) ==
    /\ state[n] = "active"
    /\ msgs' = msgs \cup {[type |-> "done", from |-> n]}
    /\ UNCHANGED <<state>>

------------------------------------------------------------
\* Next-state relation
Next ==
    \E n \in Node :
        \/ ActionA(n)
        \/ ActionB(n)

------------------------------------------------------------
\* Specification — canonical form
Spec == Init /\ [][Next]_vars

====

Safety + liveness specification (add fairness only when needed):

\* Add ONLY for liveness properties [T1, M1, M3]
Fairness ==
    /\ \A n \in Node : WF_vars(ActionA(n))
    /\ \A n \in Node : WF_vars(ActionB(n))

Spec == Init /\ [][Next]_vars /\ Fairness

Caveat [P4.1.F1]: Canonical form is a default template, not a law. Lamport himself deviates in some specifications (e.g., omitting fairness entirely for safety-only specs, using different temporal formulas for specific refinement mappings). If you deviate, document WHY with a reference to the specific pattern you are following.

1.2 Stuttering Invariance Checklist [T1, M2, M5]

Stuttering steps (where no variable changes) are fundamental to TLA+ and enable compositional refinement. The [][Next]_vars form explicitly permits them — this is not optional.

STUTTERING CHECKLIST
────────────────────
[ ] Specification uses [][Next]_vars (NOT [][Next])
    → [][Next] without _vars FORBIDS stuttering and breaks refinement.
    → The ONLY valid form is [][Next]_vars where vars is the
      tuple of ALL specification variables.

[ ] vars tuple contains ALL declared VARIABLES
    → If you add a variable, update the vars tuple immediately.
    → Missing a variable from vars silently permits arbitrary changes to it.

[ ] Every action specifies UNCHANGED for ALL unmodified variables
    → An action that mentions v1' but not v2' leaves v2 unconstrained.
    → TLC will explore states where v2 takes ANY value in its type domain.
    → This is the #1 source of spurious counterexamples.

Variable grouping pattern for specs with many variables:

\* Group related variables for manageable UNCHANGED clauses
nodeVars    == <<state, epoch>>
networkVars == <<msgs, delivered>>
configVars  == <<members, leader>>

vars == <<nodeVars, networkVars, configVars>>

\* Actions use group UNCHANGED
SendMessage(n) ==
    /\ ...
    /\ msgs' = msgs \cup {msg}
    /\ UNCHANGED <<nodeVars, configVars, delivered>>

1.3 TypeOK as First Invariant [T1, M9]

TypeOK constrains the domains of all variables. It is always the first invariant you write and the first invariant you check.

Why TypeOK matters:

Catches variable domain errors before expensive property checking
Acts as a sanity check that Init and Next are well-formed
Provides a foundation for TLAPS proofs (type correctness is a proof obligation)
Rapidly detects missing UNCHANGED clauses (unconstrained variables violate TypeOK)

TypeOK construction rules:

TYPEOK CONSTRUCTION
───────────────────
[ ] Every VARIABLE appears in TypeOK

[ ] TypeOK constrains variables to finite domains for TLC
    → "x \in Nat" is valid TLA+ but NOT model-checkable
    → Use "x \in 0..MaxBound" for model checking

[ ] TypeOK is checked BEFORE other invariants
    → In the TLC config: INVARIANT TypeOK, SafetyProp1, SafetyProp2
    → TypeOK violations are always the first thing to debug

[ ] Function-valued variables use domain constraints
    → state \in [Node -> StatusSet] not just "state is a function"
    → Check both the domain AND the range

TypeOK patterns by variable type:

Variable Type	TypeOK Pattern	Example
Simple value	`v \in DomainSet`	`epoch \in Nat` (or `0..MaxEpoch` for TLC)
Function	`v \in [Domain -> Range]`	`state \in [Node -> {"idle","active"}]`
Set	`v \subseteq BaseSet`	`msgs \subseteq MessageType`
Sequence	`v \in Seq(ElemType)`	`log \in Seq(Entry)` (use `BSeq(Entry, MaxLen)` for TLC)
Record	`v \in [field1: T1, field2: T2]`	`msg \in [type: MsgTypes, from: Node]`
Optional	`v \in T \cup {None}`	`leader \in Node \cup {None}`
Bag/Multiset	Custom predicate	`BagToSet(msgs) \subseteq MessageType`

Phase 2: Abstraction and Modeling

2.1 Seven-Step Abstraction Framework [T2, P3.1, M8]

Follow these steps in order when starting a new specification.

Step 1: State the property first

Before modeling any behavior, write down the property you want to verify in plain English. This determines what MUST be in the specification and what can be abstracted away.

Example:
  Property: "No two nodes hold the lock simultaneously"
  → Need: lock state per node
  → Don't need: network packet format, serialization, retry timers

Step 2: Identify minimal state

List the minimum set of variables needed to express the property from Step 1. Every variable must appear in at least one property or be necessary for computing a variable that does.

Minimal state test: For each variable, ask:
  "If I remove this variable, can I still express all my properties?"
  If yes → remove it.

Step 3: Model the environment as nondeterminism

Do not model external systems (clients, networks, OS) in detail. Replace them with nondeterministic choices that capture all possible behaviors.

\* BAD: modeling a specific client
ClientSend == /\ clientState = "ready"
              /\ clientState' = "waiting"
              /\ ...

\* GOOD: nondeterministic environment
ReceiveRequest(n) == \* A request can arrive at any node, any time
    /\ state[n] = "idle"
    /\ state' = [state EXCEPT ![n] = "processing"]
    /\ UNCHANGED <<...>>

Step 4: Choose communication model

Model	TLA+ Representation	Use When	Tradeoff
Set of messages	`msgs \subseteq MsgType`	Safety properties only	Simplest; loses ordering and duplicates
Bag (multiset)	`msgs \in BagType`	Need duplicate detection	Realistic; moderate complexity
Sequence per link	`chan \in [Link -> Seq(MsgType)]`	FIFO ordering required	Most complex; needed for TCP-like channels

[T2, P3.4, M14]: For most distributed protocol specifications, start with a set model. Upgrade to bag/sequence only if a property depends on message ordering or duplicate counting.

Step 5: Separate WHAT from HOW

Specify the intended behavior, not the implementation mechanism.

\* BAD: specifying HOW (implementation detail)
AcquireLock(n) ==
    /\ Send([type |-> "lock_req", from |-> n])
    /\ WaitForReply(n)
    /\ ProcessReply(n)

\* GOOD: specifying WHAT (observable behavior)
AcquireLock(n) ==
    /\ lock = None
    /\ lock' = n
    /\ UNCHANGED <<...>>

Step 6: Diagnose over/under-specification

Symptom	Diagnosis	Fix
TLC explores billions of states for a 3-node model	Over-specified (too much detail)	Remove implementation-level state
Safety property passes but system is clearly wrong	Under-specified (missing behaviors)	Add actions for failure cases, concurrent operations
Liveness property fails with trivial counterexample	Missing fairness or missing actions	Check that all intended behaviors have actions
No errors but the spec "feels wrong"	Wrong abstraction level	Return to Step 1, re-examine the property

Step 7: Refine incrementally

2.2 Ten Distributed Systems Patterns [T2, P3.4, M14]

Common modeling patterns for distributed protocols. Each pattern includes the TLA+ idiom and notes on when to use it.

Pattern 1: Quorum

\* Abstract: CONSTANT set of quorums
CONSTANT Quorum
ASSUME QuorumAssumption ==
    /\ \A Q \in Quorum : Q \subseteq Node
    /\ \A Q1, Q2 \in Quorum : Q1 \cap Q2 /= {}

\* OR computed majority (simpler for symmetric systems)
Quorum == {S \in SUBSET Node : Cardinality(S) * 2 > Cardinality(Node)}

For model checking, the computed form is often better (no need to enumerate quorums in the config). For proof, the abstract CONSTANT form is more general.

Pattern 2: Epoch / Term (Monotonic Counter)

\* Per-node monotonic epoch
VARIABLE currentEpoch  \* currentEpoch \in [Node -> Nat]

\* Epoch advances are strictly increasing per node
AdvanceEpoch(n) ==
    /\ currentEpoch' = [currentEpoch EXCEPT ![n] = currentEpoch[n] + 1]
    /\ ...

\* Stale-epoch rejection
ProcessMessage(n, msg) ==
    /\ msg.epoch >= currentEpoch[n]  \* reject stale
    /\ ...

Pattern 3: Leader Election (Nondeterministic)

\* Abstract leader election — nondeterministic candidacy
BecomeLeader(n) ==
    /\ \A m \in Node : leader[m] /= n  \* not already leader
    /\ leader' = [leader EXCEPT ![n] = n]
    /\ currentEpoch' = [currentEpoch EXCEPT ![n] = currentEpoch[n] + 1]
    /\ UNCHANGED <<...>>

Do NOT model the election protocol (Raft, ZAB) unless you are verifying the election protocol itself. Abstract it as nondeterministic leader selection with appropriate constraints.

Pattern 4: Timeout (Always-Enabled Action, NO Clock)

\* BAD: modeling a clock for timeouts
Timeout(n) ==
    /\ clock[n] >= deadline[n]  \* NO — do not model clocks
    /\ ...

\* GOOD: timeout as an always-enabled action
Timeout(n) ==
    /\ state[n] = "waiting"
    /\ state' = [state EXCEPT ![n] = "timed_out"]
    /\ UNCHANGED <<...>>
\* Timeout can fire at ANY time a node is waiting — this captures
\* all possible timeout behaviors without modeling time.

[T1]: Never model real-time clocks in a TLA+ specification unless you are specifically verifying real-time properties. Timeouts are modeled as always-enabled actions that capture the nondeterminism of when they fire.

Pattern 5: Log Replication (Sequence with Truncate-Overwrite)

\* Log as sequence with append and truncate-overwrite
VARIABLE log  \* log \in [Node -> Seq(Entry)]

AppendEntry(n, entry) ==
    /\ log' = [log EXCEPT ![n] = Append(log[n], entry)]
    /\ UNCHANGED <<...>>

TruncateAndOverwrite(n, idx, entries) ==
    /\ log' = [log EXCEPT ![n] = SubSeq(log[n], 1, idx) \o entries]
    /\ UNCHANGED <<...>>

Pattern 6: Membership Change (Config as Log Entry)

\* Joint consensus: configuration transitions via log
VARIABLE config  \* config \in [Node -> SUBSET Node]

\* Configuration change is a special log entry
ProposeConfigChange(leader, newMembers) ==
    /\ AppendEntry(leader, [type |-> "config", members |-> newMembers])
    /\ config' = [config EXCEPT ![leader] = config[leader] \cup newMembers]
    /\ UNCHANGED <<...>>

Pattern 7: Idempotency (Accumulating Set)

\* Track processed operation IDs to ensure idempotency
VARIABLE processed  \* processed \in [Node -> SUBSET OpId]

ProcessOp(n, op) ==
    /\ op.id \notin processed[n]  \* skip if already processed
    /\ \* ... perform operation ...
    /\ processed' = [processed EXCEPT ![n] = processed[n] \cup {op.id}]
    /\ UNCHANGED <<...>>

Pattern 8: Two-Phase Protocol (Prepare-Collect-Decide)

\* Two-phase pattern: prepare → collect votes → decide
VARIABLE phase    \* [Node -> {"idle", "prepare", "decide"}]
VARIABLE votes    \* [Node -> SUBSET Node]

Prepare(n) ==
    /\ phase[n] = "idle"
    /\ phase' = [phase EXCEPT ![n] = "prepare"]
    /\ votes' = [votes EXCEPT ![n] = {}]
    /\ \* send prepare messages
    /\ UNCHANGED <<...>>

ReceiveVote(n, voter) ==
    /\ phase[n] = "prepare"
    /\ votes' = [votes EXCEPT ![n] = votes[n] \cup {voter}]
    /\ UNCHANGED <<...>>

Decide(n) ==
    /\ phase[n] = "prepare"
    /\ votes[n] \in Quorum  \* collected a quorum
    /\ phase' = [phase EXCEPT ![n] = "decide"]
    /\ UNCHANGED <<...>>

Pair with a refinement mapping to show this implements a simpler atomic specification. [T2]

Pattern 9: Failure and Restart (Durable vs Volatile State)

\* Failure: volatile state lost, durable state preserved
VARIABLE volatileState  \* lost on restart
VARIABLE durableState   \* preserved on restart

Crash(n) ==
    /\ volatileState' = [volatileState EXCEPT ![n] = InitVolatile]
    /\ UNCHANGED <<durableState, ...>>
    \* Durable state is UNCHANGED — this is the key invariant

Restart(n) ==
    /\ \* recover from durableState[n]
    /\ volatileState' = [volatileState EXCEPT
                           ![n] = RecoverFrom(durableState[n])]
    /\ UNCHANGED <<durableState, ...>>

Pattern 10: Ghost Variables (Append-Only History)

\* Ghost variable: records history for property checking
\* Does NOT affect system behavior — only used in invariants/properties
VARIABLE history  \* history \in Seq(Event)

\* Append to history in every action that is relevant
ActionWithHistory(n) ==
    /\ \* ... normal action logic ...
    /\ history' = Append(history, [type |-> "action", node |-> n, ...])
    /\ UNCHANGED <<...>>

\* Use ghost variable in properties
NoDoubleCommit ==
    \A i, j \in 1..Len(history) :
        /\ history[i].type = "commit"
        /\ history[j].type = "commit"
        /\ history[i].key = history[j].key
        => history[i].value = history[j].value

Ghost variables are specification-only artifacts. They are append-only (history never shrinks) and used solely to express properties that reference past states. Mark them clearly in comments.

Phase 3: Safety Verification

After establishing the specification structure (Phase 1) and modeling decisions (Phase 2), verify safety properties.

3.1 Safety Specification Checklist [T1, M9, M5]

SAFETY CHECKLIST
────────────────
[ ] TypeOK is defined and checked first
    → Every variable constrained to its domain [T1, M9]

[ ] All safety invariants expressed as state predicates
    → "P is always true" written as INVARIANT P in TLC config
    → For action properties, use [][ActionProp]_vars form

[ ] Deadlock-freedom checked between safety and liveness phases
    → TLC checks deadlock by default (all states have a Next step)
    → Intentional termination: add a Done action or use
      PROPERTY [][Next]_vars => <>[]done to distinguish
    → [P4.4.F6]: ALWAYS check deadlock-freedom after safety passes
      but BEFORE adding liveness. A deadlocked spec trivially
      satisfies some liveness properties (vacuous truth).

[ ] No CONSTRAINT in the TLC config for the final check
    → CONSTRAINT is acceptable during development to speed iteration
    → CONSTRAINT MUST be removed for the final safety verification
    → [T1, M11]: CONSTRAINT can hide states reachable via longer paths

[ ] Invariants are inductive (for TLAPS proofs)
    → An invariant I is inductive if: I /\ Next => I'
    → TypeOK is typically inductive; complex safety properties
      may need strengthening (adding auxiliary conjuncts)

3.2 Deadlock-Freedom Gate [T1, P4.4.F6]

This check is mandatory between the safety and liveness phases.

A specification deadlocks if some reachable state has no Next step enabled. TLC checks this by default. However:

If your protocol has intentional termination, you must distinguish it from unintentional deadlock:

\* Termination action — prevents false deadlock reports
Done ==
    /\ AllNodesFinished
    /\ UNCHANGED vars

Next == ActionA \/ ActionB \/ ... \/ Done

A deadlocked spec can vacuously satisfy liveness properties. If the system deadlocks in a state where the liveness property is already satisfied, TLC will report no error. This is not a real liveness guarantee.

DEADLOCK GATE
─────────────
[ ] TLC reports zero deadlock states
    OR
[ ] All "deadlock" states are intentional termination states
    AND they are documented in the spec
    AND a Done action is present

3.3 TLC Safety Configuration [T1, M11, M12]

Recommended TLC configuration for safety checking:

\* MyProtocol.cfg — safety checking configuration
SPECIFICATION Spec

\* Type invariant first, then safety properties
INVARIANT TypeOK
INVARIANT SafetyProperty1
INVARIANT SafetyProperty2

\* State constraint for development ONLY — remove for final check
\* CONSTRAINT StateConstraint

\* Symmetry sets — safe for safety properties [T1, M11]
SYMMETRY NodeSymmetry

\* Constants
CONSTANT Node = {n1, n2, n3}
CONSTANT MaxValue = 3

TLC Option	Safety Checking	Notes
`INVARIANT`	Sound	Always use for state predicates
`CONSTRAINT`	Sound but incomplete	May miss states reachable via longer paths; remove for final check
`SYMMETRY`	Sound if symmetry set is correct	Incorrect symmetry sets produce unsound results silently
`VIEW`	Sound	Reduces state space via abstraction
`ACTION_CONSTRAINT`	Sound	Restricts which actions TLC explores

Phase 4: Liveness Verification

Liveness properties ("something good eventually happens") require fairness assumptions and careful TLC configuration. This is where most soundness errors occur.

4.1 Six-Phase Liveness Methodology [T1, M3, P3.2, P3.3]

Follow these phases strictly in order.

Phase L1: Write safety first (complete Phase 3)

Never add liveness to a specification that has not passed safety checking. Liveness debugging on an incorrect safety spec is wasted effort.

Phase L2: Identify liveness requirements

Write down what MUST eventually happen in plain English:

Examples:
  - "Every request eventually receives a response"
  - "Every node eventually learns of every update"
  - "The leader eventually commits every proposed entry"
  - "The system eventually reaches a stable configuration"

Express each as a temporal formula:

\* "Every request eventually receives a response"
Liveness1 == \A r \in requests : <>(responded[r])

\* "The leader eventually commits" (conditional on leadership)
Liveness2 == []<>(committed = proposed)

\* Response: "always eventually" (repeated liveness)
Liveness3 == []<>(\E n \in Node : state[n] = "active")

Phase L3: Select fairness — WF first, SF only when needed [T2, M4, P3.3]

Use the fairness selection decision table:

Condition	Fairness	Rationale
Action is continuously enabled once enabled	`WF_vars(Action)`	WF suffices: if enabled forever, it eventually happens
Action is repeatedly enabled then disabled (intermittent)	`SF_vars(Action)`	SF needed: must happen if repeatedly enabled, even if intermittently disabled
Action represents environment/external input	No fairness	Never make the environment fair — it models adversarial behavior
Action is the composite `Next`	No fairness on `Next` alone	`WF_vars(Next)` says "some action happens" but does not guarantee WHICH action happens; useless for most properties
Timeout action	Usually no fairness	Timeouts model nondeterminism; fairness on timeouts constrains which timeouts fire

WF vs SF decision procedure:

Start with WF_vars(A) for every system action A
Check if TLC finds a liveness counterexample
If the counterexample shows Action A repeatedly enabled then disabled without executing, upgrade to SF_vars(A) for that specific action
Never start with SF — it is a stronger assumption and may hide real liveness bugs

[T2, M4]: Lamport's convention: "Weak fairness is almost always what you want." Strong fairness is needed only for actions that are intermittently disabled by other actions (e.g., a receive action that is enabled only when a message is in transit, and messages can be repeatedly consumed and re-sent).

Phase L4: Verify machine closure (automatic in canonical form)

In the canonical form, machine closure is guaranteed [T1, M1]. You only need to worry about machine closure if you deviate from canonical form (e.g., adding raw temporal formulas to the spec).

Phase L5: Model-check liveness [T1, P3.2, M11, M12]

THIS IS THE CRITICAL STEP. TLC's liveness checking has soundness traps that do not exist in safety checking.

LIVENESS MODEL-CHECKING RULES
──────────────────────────────
[ ] REMOVE all CONSTRAINT directives from the TLC config
    → CONSTRAINT + liveness = UNSOUND [T1, M11]
    → CONSTRAINT can hide infinite behaviors that violate liveness
    → This is NON-NEGOTIABLE for liveness checking

[ ] REMOVE all SYMMETRY directives from the TLC config
    → SYMMETRY + liveness = DANGEROUS [T1, M11]
    → Symmetry reduction may cause TLC to miss liveness violations
    → It is sometimes safe but the conditions are subtle;
      removing it is always safe

[ ] Use -lncheck final for production liveness verification
    → -lncheck final performs the complete liveness check
    → Without -lncheck, TLC may use an incomplete algorithm

[ ] Liveness properties go in PROPERTY (not INVARIANT)
    → INVARIANT is for state predicates (safety)
    → PROPERTY is for temporal formulas (liveness)

Phase L6: Debug liveness failures [T1, P3.2]

Lasso structure:
  s0 → s1 → s2 → ... → sk → ... → sj
                               ↑        |
                               └────────┘
                             (back-loop: infinite repetition)

Debugging procedure:
1. Read the finite prefix: how does the system reach state sk?
2. Read the back-loop: what actions repeat forever?
3. Check: is there an action that SHOULD fire but doesn't?
   → If yes: add or strengthen fairness for that action
4. Check: is the counterexample realistic?
   → Unrealistic: the environment acts adversarially in a way your
     system need not tolerate. Consider if your liveness requirement
     is too strong.
   → Realistic: the protocol has a liveness bug. Fix the protocol.
5. Check: does the back-loop involve only stutter steps?
   → If yes: your specification may be missing an action.

4.2 Complete TLC Soundness Matrix [T1, M11, M12]

This is the complete map of TLC configuration options versus soundness for safety and liveness checking. Reference this table before every TLC run.

Config Option	Safety	Liveness	Risk	Recommendation
INVARIANT	Sound	N/A (wrong directive)	None for safety	Use for state predicates
PROPERTY	N/A (wrong directive)	Sound (with caveats below)	None	Use for temporal formulas
CONSTRAINT	Sound (but incomplete)	UNSOUND	High: hides reachable states from liveness check	Remove for final liveness check; acceptable during development for safety only
SYMMETRY	Sound (if correct)	Dangerous	Medium: may miss counterexamples	Remove for liveness; use only for safety with verified symmetry set
VIEW	Sound	Unverified	Medium: hash collisions possible	Use with caution; verify key results without VIEW
ACTION_CONSTRAINT	Sound	Context-dependent	Medium: may prevent actions needed for liveness counterexample	Review carefully for liveness
-lncheck final	N/A	Required for soundness	High if omitted: may use incomplete algorithm	Always use for production liveness checks
-workers N (N>1)	Sound	Sound (TLC 2.18+)	Low: older TLC versions had bugs with multi-worker liveness	Use current TLC version

[T1, P3.2]: The safe liveness workflow: (1) pass all safety checks with CONSTRAINT/SYMMETRY, (2) remove CONSTRAINT and SYMMETRY, (3) add PROPERTY directives, (4) run with -lncheck final, (5) accept the longer runtime.

Phase 5: Model Checking Configuration

5.1 Small Model Heuristics [T2, M10, P4.4.F3]

AWS's experience (Newcombe et al., CACM 2015): 3-5 processes are sufficient for finding most design bugs. Start small, increase only when needed.

Model Size	When Sufficient	When Insufficient
2 nodes	Mutual exclusion, simple leader-follower	Quorum-based protocols (need 3+)
3 nodes	Most consensus protocols, quorum systems	Protocols with f+1 requirements where f>1
4-5 nodes	Complex fault-tolerance (f=1 with 2f+1)	Threshold-dependent bugs (e.g., f=2 requires 5+ nodes)
>5 nodes	Byzantine fault tolerance (3f+1)	Rarely needed for design verification

Caveat [P4.4.F3]: Small models miss:

Threshold-dependent bugs: Properties that only fail at specific quorum sizes (e.g., a bug that manifests only with exactly 2 of 5 nodes failing)

Symmetry-breaking bugs: Bugs that require specific asymmetric configurations

Scaling-related issues: Performance properties that degrade non-linearly

If your protocol has threshold-dependent behavior, test at least one configuration AT the threshold and one ABOVE it.

5.2 TLC Configuration Guide

Development configuration (fast iteration):

SPECIFICATION Spec
INVARIANT TypeOK
INVARIANT SafetyProperty
CONSTRAINT SmallStateConstraint   \* OK for safety development
SYMMETRY NodeSymmetry             \* OK for safety development
CONSTANT Node = {n1, n2, n3}     \* Start small

Final safety configuration (production):

SPECIFICATION Spec
INVARIANT TypeOK
INVARIANT SafetyProperty
\* NO CONSTRAINT — full state space
SYMMETRY NodeSymmetry             \* OK for safety (if correct)
CONSTANT Node = {n1, n2, n3}

Final liveness configuration (production):

SPECIFICATION Spec
INVARIANT TypeOK                  \* Keep safety checks active
PROPERTY LivenessProperty
\* NO CONSTRAINT — UNSOUND with liveness
\* NO SYMMETRY — DANGEROUS with liveness
CONSTANT Node = {n1, n2, n3}

TLC command-line flags:

Flag	Purpose	When to Use
`-workers auto`	Use all CPU cores	Always (production runs)
`-lncheck final`	Complete liveness check	Always for liveness verification
`-deadlock`	Disable deadlock checking	Only if spec has intentional terminal states AND Done action is present
`-depth N`	Limit search depth	Development only; never for final verification
`-coverage N`	Print coverage stats every N minutes	Development; helps find unreachable actions
`-dump format file`	Dump state graph	Debugging counterexamples
`-simulate num=N`	Random simulation instead of full BFS	Very large state spaces where exhaustive check is infeasible

5.3 Multi-Tool Awareness [T3, P4.3.F6, P4.4.F2]

TLC is not the only TLA+ verification tool. Be aware of alternatives and their strengths.

Tool	Technique	Strengths	Limitations
TLC	Explicit-state model checking	Complete for finite models; well-understood soundness; fast for small models	Exponential state explosion; finite models only
Apalache	Symbolic model checking (SMT)	Handles larger state spaces; bounded model checking for unbounded domains; type checking	Slower per-state; less mature than TLC; limited liveness support
TLAPS	Interactive theorem proving	Full mathematical proofs; unbounded verification; strongest guarantee	High manual effort; steep learning curve
Quint	Emerging: TLA+-like language with better tooling	Modern syntax; integrated testing; simulation	Not yet production-proven; subset of TLA+ expressiveness

TLC-specific advice is marked throughout this skill. Where behavior differs in Apalache, notes are included. However, this skill primarily targets TLC workflows. For Apalache-specific guidance, consult the Apalache documentation.

Apalache type annotations: Apalache requires type annotations that TLC ignores. If you plan to use both tools, add Apalache type annotations as comments:
\* @type: Set(NODE) => Bool;
IsQuorum(S) == Cardinality(S) * 2 > Cardinality(Node)

Phase 6: PlusCal Considerations [T3, M16, P4.4.F1]

PlusCal is a higher-level algorithmic language that compiles to TLA+. It is a valid choice for certain specifications.

6.1 PlusCal Decision Table

Factor	PlusCal	Raw TLA+
Author familiarity	Imperative/sequential programming	Set theory, temporal logic
Specification type	Sequential or simple concurrent algorithms	Complex concurrent/distributed protocols
Expressiveness ceiling	Limited: sequential composition, limited nondeterminism	Full: arbitrary temporal logic, flexible composition
Compilation	Compiles to TLA+ (can inspect/modify output)	Direct
Fairness	Automatic per-process fairness	Manual fairness specification
Best for	Teaching, prototyping, algorithm verification	Protocol verification, refinement, industrial use

[T3, P4.4.F1]: A finished PlusCal specification is better than an unfinished TLA+ specification. If PlusCal gets you to a checkable spec faster, use it. But know the ceiling: PlusCal cannot express all TLA+ patterns (e.g., arbitrary interleaving of non-process actions, complex fairness, stuttering-sensitive refinement).

When to switch from PlusCal to TLA+:

You need fairness on individual actions (not just processes)

You need to compose specifications via refinement

You need ghost variables that participate in the temporal formula

PlusCal's process model does not match your system's concurrency model

Quick Reference

Common mistakes catalog, specification template, decision tables, evidence traceability matrix, and references are in references/quick-reference.md.

tla-spec

同仓库更多 Skills

同仓库更多 Skills

TLA+ Specification Correctness Guide

Evidence Base

Confidence Tiers

When to Use

When NOT to Use

Phase 1: Specification Foundations

1.1 Canonical Form [T1, M1, P4.1.F1]

1.2 Stuttering Invariance Checklist [T1, M2, M5]

1.3 TypeOK as First Invariant [T1, M9]

Phase 2: Abstraction and Modeling

2.1 Seven-Step Abstraction Framework [T2, P3.1, M8]

2.2 Ten Distributed Systems Patterns [T2, P3.4, M14]

Pattern 1: Quorum

Pattern 2: Epoch / Term (Monotonic Counter)

Pattern 3: Leader Election (Nondeterministic)

Pattern 4: Timeout (Always-Enabled Action, NO Clock)

Pattern 5: Log Replication (Sequence with Truncate-Overwrite)

Pattern 6: Membership Change (Config as Log Entry)

Pattern 7: Idempotency (Accumulating Set)

Pattern 8: Two-Phase Protocol (Prepare-Collect-Decide)

Pattern 9: Failure and Restart (Durable vs Volatile State)

Pattern 10: Ghost Variables (Append-Only History)

Phase 3: Safety Verification

3.1 Safety Specification Checklist [T1, M9, M5]

3.2 Deadlock-Freedom Gate [T1, P4.4.F6]

3.3 TLC Safety Configuration [T1, M11, M12]

Phase 4: Liveness Verification

4.1 Six-Phase Liveness Methodology [T1, M3, P3.2, P3.3]

4.2 Complete TLC Soundness Matrix [T1, M11, M12]

Phase 5: Model Checking Configuration

5.1 Small Model Heuristics [T2, M10, P4.4.F3]

5.2 TLC Configuration Guide

5.3 Multi-Tool Awareness [T3, P4.3.F6, P4.4.F2]

Phase 6: PlusCal Considerations [T3, M16, P4.4.F1]

6.1 PlusCal Decision Table

Quick Reference

TLA+ Specification Correctness Guide

Evidence Base

Confidence Tiers

When to Use

When NOT to Use

Phase 1: Specification Foundations

1.1 Canonical Form [T1, M1, P4.1.F1]

1.2 Stuttering Invariance Checklist [T1, M2, M5]

1.3 TypeOK as First Invariant [T1, M9]

Phase 2: Abstraction and Modeling

2.1 Seven-Step Abstraction Framework [T2, P3.1, M8]

2.2 Ten Distributed Systems Patterns [T2, P3.4, M14]

Pattern 1: Quorum

Pattern 2: Epoch / Term (Monotonic Counter)

Pattern 3: Leader Election (Nondeterministic)

Pattern 4: Timeout (Always-Enabled Action, NO Clock)

Pattern 5: Log Replication (Sequence with Truncate-Overwrite)

Pattern 6: Membership Change (Config as Log Entry)

Pattern 7: Idempotency (Accumulating Set)

Pattern 8: Two-Phase Protocol (Prepare-Collect-Decide)

Pattern 9: Failure and Restart (Durable vs Volatile State)

Pattern 10: Ghost Variables (Append-Only History)

Phase 3: Safety Verification

3.1 Safety Specification Checklist [T1, M9, M5]

3.2 Deadlock-Freedom Gate [T1, P4.4.F6]

3.3 TLC Safety Configuration [T1, M11, M12]

Phase 4: Liveness Verification

4.1 Six-Phase Liveness Methodology [T1, M3, P3.2, P3.3]

4.2 Complete TLC Soundness Matrix [T1, M11, M12]

Phase 5: Model Checking Configuration

5.1 Small Model Heuristics [T2, M10, P4.4.F3]

5.2 TLC Configuration Guide

5.3 Multi-Tool Awareness [T3, P4.3.F6, P4.4.F2]

Phase 6: PlusCal Considerations [T3, M16, P4.4.F1]

6.1 PlusCal Decision Table

Quick Reference