Run any Skill in Manus with one click

$pwd:

rasa-configuring-model-groups

Name: Rasa Configuring Model Groups
Author: RasaHQ

// Configures model_groups in endpoints.yml for LLM and embedding providers. Covers single and multi-deployment setups, routing strategies, failover, self-hosted models, and caching. Use when adding or changing LLM providers or setting up multi-LLM routing.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:March 3, 2026 at 12:39

SKILL.md

readonly

package.json

"author": "RasaHQ"

"repository": "RasaHQ/rasa-cursor-plugin"

View GitHub Repository

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

Run any Skill with one click

name	rasa-configuring-model-groups
description	Configures model_groups in endpoints.yml for LLM and embedding providers. Covers single and multi-deployment setups, routing strategies, failover, self-hosted models, and caching. Use when adding or changing LLM providers or setting up multi-LLM routing.
license	Apache-2.0
metadata	{"author":"rasa","version":"3.x","docs-url":"https://rasa.com/docs/pro/deploy/llm-routing"}

Configuring Model Groups

Model groups are defined in endpoints.yml under the model_groups key. Pipeline components, the rephraser, and other features reference groups by their id. Each group contains one or more model deployments and an optional router for multi-deployment routing.

Workflow

Open endpoints.yml (create if it doesn't exist).
Add a model group for the LLM used by the pipeline's command generator.
Add a model group for embeddings if flow retrieval is enabled.
If multiple deployments are needed, add them to the same group and configure a routing strategy (see "Multi-deployment routing").
Reference the group id from config.yml pipeline components (see rasa-configuring-assistant skill).

Providers

Rasa provides dedicated client wrappers only for certain providers. The supported sets differ for LLM and embeddings.

LLM (Rasa wrappers):

openai,
azure,
self-hosted,
rasa.

Embeddings (Rasa wrappers):

openai,
azure,
huggingface_local.

For any other provider (e.g. Anthropic, Cohere, Google), use the provider keys and options from LiteLLM's provider list, since Rasa's generic clients are built on LiteLLM.

Configuring single provider

The simplest setup — one deployment per group:

model_groups:
  - id: openai_llm
    models:
      - provider: openai
        model: gpt-4o-2024-11-20

  - id: openai_embeddings
    models:
      - provider: openai
        model: text-embedding-3-large

To switch providers, change provider and add any required provider-specific settings:

model_groups:
  - id: azure_llm
    models:
      - provider: azure
        deployment: rasa-gpt-4
        api_base: https://my-azure-instance/
        api_version: "2024-02-15-preview"
        api_key: ${AZURE_API_KEY}

Configuring multi-deployment routing

Place multiple deployments in the same group for load balancing, failover, or latency optimization. Add a router block to control distribution.

Keep deployments in a group on the same underlying model — mixing fundamentally different models (e.g. GPT-3.5 vs GPT-4) leads to unpredictable behavior. Router settings are per-group and independent — each group can use a different strategy.

model_groups:
  - id: azure_llm
    models:
      - provider: azure
        deployment: gpt-4-france
        api_base: https://azure-france/
        api_version: "2024-02-15-preview"
        api_key: ${AZURE_KEY_FRANCE}
      - provider: azure
        deployment: gpt-4-canada
        api_base: https://azure-canada/
        api_version: "2024-02-15-preview"
        api_key: ${AZURE_KEY_CANADA}
    router:
      routing_strategy: least-busy

Routing strategies

Strategy	Description
`simple-shuffle`	Distributes based on RPM (requests per minute) or weight
`least-busy`	Routes to deployment with fewest ongoing requests
`latency-based-routing`	Routes to lowest-latency deployment
`cost-based-routing`	Routes to lowest-cost deployment (requires Redis)
`usage-based-routing`	Routes to lowest-usage deployment (requires Redis)

Router customization

Fine-tune failover behavior with these optional parameters:

router:
  routing_strategy: least-busy
  cooldown_time: 10        # seconds before retrying a failed deployment
  allowed_fails: 2         # failures before marking deployment unavailable
  num_retries: 3           # retries per failed request

Refer to the LiteLLM's routing configuration documentation for more information on the configuration parameters.

Redis for cost/usage routing

Cost- and usage-based strategies track token usage over time and require a Redis backend:

router:
  routing_strategy: cost-based-routing
  redis_host: localhost
  redis_port: 6379
  redis_password: ${REDIS_PASSWORD}

Or via URL:

router:
  routing_strategy: usage-based-routing
  redis_url: "redis://:${REDIS_PASSWORD}@host:6379"

Caching

Enable response caching to reduce load and cost. For production, back it with Redis (in-memory caching does not persist across restarts):

router:
  routing_strategy: simple-shuffle
  cache_responses: true

Embeddings routing

The same routing configuration (strategies, Redis, caching) works for embedding model groups — just use an embeddings provider in the models list.

Self-hosted models

Use provider: self-hosted for vLLM and Llama.cpp, or provider: ollama for Ollama. Multiple instances can be routed just like cloud deployments.

When routing is enabled for self-hosted models, use_chat_completions_endpoint must be set at the router level, not on individual models.

# vLLM
model_groups:
  - id: vllm_llm
    models:
      - provider: self-hosted
        model: meta-llama/Meta-Llama-3-8B
        api_base: "http://localhost:8000/v1"
      - provider: self-hosted
        model: meta-llama/Meta-Llama-3-8B
        api_base: "http://localhost:8001/v1"
    router:
      routing_strategy: least-busy
      use_chat_completions_endpoint: false   # router level, not model level

# Ollama
model_groups:
  - id: ollama_llm
    models:
      - provider: ollama
        model: llama3.1
        api_base: "http://localhost:11434"

LiteLLM proxy

Route through a LiteLLM proxy server using provider: litellm_proxy:

model_groups:
  - id: litellm_proxy_llm
    models:
      - provider: litellm_proxy
        model: gpt-4-instance-1
        api_base: "http://localhost:4000"
      - provider: litellm_proxy
        model: gpt-4-instance-2
        api_base: "http://localhost:4000"
    router:
      routing_strategy: least-busy

name	rasa-configuring-model-groups
description	Configures model_groups in endpoints.yml for LLM and embedding providers. Covers single and multi-deployment setups, routing strategies, failover, self-hosted models, and caching. Use when adding or changing LLM providers or setting up multi-LLM routing.
license	Apache-2.0
metadata	{"author":"rasa","version":"3.x","docs-url":"https://rasa.com/docs/pro/deploy/llm-routing"}