Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

aqua-evaluation

Name: Aqua Evaluation
Author: oracle

// Evaluate LLM model quality using BERTScore, ROUGE, Perplexity, and Text Readability metrics on OCI AI Quick Actions (AQUA). Covers dataset preparation, evaluation job creation, and report interpretation. Triggered when user wants to evaluate or benchmark a model.

Ejecutar en Manus

$ git log --oneline --stat

stars:126

forks:65

updated:28 de febrero de 2026, 16:13

Explorador de archivos

3 archivos

SKILL.md

readonly

related-skills.json

mismo repositorio

aqua-cli.md

from "oracle/accelerated-data-science"

Complete CLI reference for the ADS AQUA command-line interface (ads aqua). Covers all model, deployment, evaluation, and fine-tuning commands with full parameter documentation. Triggered when user asks about CLI commands, wants to run AQUA operations from terminal, or needs command syntax.

2026-02-28126

aqua-deployment.md

from "oracle/accelerated-data-science"

Deploy LLM models on OCI using AI Quick Actions (AQUA) - single model, multi-model, stacked (LoRA), with GPU shape selection, vLLM configuration, streaming, and tool calling. Triggered when user wants to deploy, update, or manage model deployments.

2026-02-28126

aqua-finetuning.md

from "oracle/accelerated-data-science"

Fine-tune LLM models using LoRA on OCI AI Quick Actions (AQUA). Covers dataset preparation (instruction, conversational, multimodal, tokenized formats), hyperparameter tuning, distributed training, and training metrics. Triggered when user wants to fine-tune or customize a model.

2026-02-28126

aqua-metrics.md

from "oracle/accelerated-data-science"

Set up Prometheus and Grafana monitoring for AQUA vLLM model deployments on OCI. Covers the signing proxy, container registry setup, OCI Container Instance deployment, and PromQL dashboards. Triggered when user wants to monitor LLM deployments, view TTFT/latency/throughput metrics, or set up observability for AQUA.

2026-02-28126

aqua-model-lifecycle.md

from "oracle/accelerated-data-science"

Register, list, get, and manage LLM models in OCI AI Quick Actions (AQUA) using the ADS SDK. Triggered when user wants to import models from HuggingFace or Object Storage, browse available models, or manage model catalog entries.

2026-02-28126

aqua-troubleshooting.md

from "oracle/accelerated-data-science"

Diagnose and fix OCI AI Quick Actions (AQUA) issues including deployment failures, OOM errors, authorization problems, capacity issues, container errors, and policy misconfigurations. Triggered when user encounters errors or needs help debugging AQUA workflows.

2026-02-28126

package.json

"author": "oracle"

"repository": "oracle/accelerated-data-science"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Científicos de datosOcupaciones informáticas y matemáticas15-2051L4

name	aqua-evaluation
description	Evaluate LLM model quality using BERTScore, ROUGE, Perplexity, and Text Readability metrics on OCI AI Quick Actions (AQUA). Covers dataset preparation, evaluation job creation, and report interpretation. Triggered when user wants to evaluate or benchmark a model.
user-invocable	true
disable-model-invocation	false

AQUA Model Evaluation

Use this skill when the user wants to evaluate LLM models on OCI Data Science using AI Quick Actions.

Supported Metrics

Metric	Description	Best For
BERTScore	Embedding-based semantic similarity (precision, recall, F1)	General text quality, aligns well with human judgement
ROUGE	N-gram overlap between generated and reference text	Summarization tasks
Perplexity	How well the model predicts the text	Language modeling quality
Text Readability	Reading level / complexity of generated text	Content accessibility

Dataset Format

JSONL format with required prompt and completion keys, optional category:

{"prompt": "Summarize this dialog:\nAmanda: I baked cookies...", "completion": "Amanda baked cookies and will bring some for Jerry tomorrow."}
{"prompt": "Translate to French: Hello world", "completion": "Bonjour le monde", "category": "translation"}
{"prompt": "What is 2+2?", "completion": "4", "category": "math"}

The category field dimensions evaluation metrics in the report (e.g., see accuracy per category). When omitted, defaults to "_" (unknown).

Official sample datasets (10 prompts each, math + logic categories):

File	Use Case
`examples/evaluation-sample-no-sys-message.jsonl`	Llama-style prompts without system message
`examples/evaluation-sample-with-sys-message.jsonl`	Llama-style prompts with `<<SYS>>` system message

Python SDK Usage

Import

from ads.aqua.evaluation import AquaEvaluationApp
eval_app = AquaEvaluationApp()

Create Evaluation

from ads.aqua.evaluation.entities import CreateAquaEvaluationDetails

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx",  # Deployment OCID
    evaluation_name="llama-3.1-8b-eval-bertscore",
    dataset_path="oci://my-bucket@my-namespace/datasets/eval_data.jsonl",
    report_path="oci://my-bucket@my-namespace/eval-reports/",
    model_parameters={
        "max_tokens": 500,
        "temperature": 0.7,
        "top_p": 0.9,
    },
    shape_name="VM.Standard.E4.Flex",
    block_storage_size=50,
    compartment_id="ocid1.compartment.oc1..xxx",
    project_id="ocid1.datascienceproject.oc1.iad.xxx",
    log_group_id="ocid1.loggroup.oc1.iad.xxx",
    log_id="ocid1.log.oc1.iad.xxx",
    metrics=[
        {"name": "bertscore"},
        {"name": "rouge"},
    ],
)
evaluation = eval_app.create(create_evaluation_details=details)
print(f"Evaluation: {evaluation.id} | State: {evaluation.lifecycle_state}")

Evaluate a Model (not deployment) Directly

You can pass a model OCID instead of a deployment OCID:

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodel.oc1.iad.xxx",  # Model OCID
    # ... rest of params same as above
)

Evaluate Stacked/Multi-Model Deployment

For stacked or multi-model deployments, specify which model to evaluate:

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx",
    model_parameters={
        "max_tokens": 500,
        "temperature": 0.7,
        "model": "llama-3.1-8b-customer-support",  # Target specific model in deployment
    },
    # ... rest of params
)

With Experiment Tracking

details = CreateAquaEvaluationDetails(
    evaluation_source_id="ocid1.datasciencemodeldeployment.oc1.iad.xxx",
    evaluation_name="llama-eval-v2",
    dataset_path="oci://my-bucket@my-namespace/datasets/eval_data.jsonl",
    report_path="oci://my-bucket@my-namespace/eval-reports/",
    model_parameters={"max_tokens": 500, "temperature": 0.7},
    shape_name="VM.Standard.E4.Flex",
    block_storage_size=50,
    experiment_name="llama-evaluations",  # Groups evaluations together
    experiment_description="Llama 3.1 evaluation experiments",
    metrics=[{"name": "bertscore"}, {"name": "rouge"}],
)

List Evaluations

evaluations = eval_app.list(compartment_id="ocid1.compartment.oc1..xxx")
for e in evaluations:
    print(f"{e.display_name} | {e.lifecycle_state}")

Get Evaluation Details

evaluation = eval_app.get(eval_id="ocid1.datasciencemodel.oc1.iad.xxx")

CLI Usage

Create Evaluation

ads aqua evaluation create \
  --evaluation_source_id "ocid1.datasciencemodeldeployment.oc1.iad.xxx" \
  --evaluation_name "llama-eval-bertscore" \
  --dataset_path "oci://my-bucket@my-namespace/datasets/eval_data.jsonl" \
  --report_path "oci://my-bucket@my-namespace/eval-reports/" \
  --model_parameters '{"max_tokens": 500, "temperature": 0.7}' \
  --shape_name "VM.Standard.E4.Flex" \
  --block_storage_size 50 \
  --compartment_id "ocid1.compartment.oc1..xxx" \
  --project_id "ocid1.datascienceproject.oc1.iad.xxx" \
  --metrics '[{"name": "bertscore"}, {"name": "rouge"}]'

List / Get Evaluations

ads aqua evaluation list --compartment_id "ocid1.compartment.oc1..xxx"
ads aqua evaluation get --eval_id "ocid1.datasciencemodel.oc1.iad.xxx"

Interpreting Results

BERTScore

The evaluation produces a Model Catalog entry with:

Precision: How much of the generated text is semantically represented in the reference
Recall: How much of the reference text is captured by the generated text
F1: Harmonic mean of precision and recall

Higher scores = better quality. Scores clustered around the mean indicate consistent performance.

ROUGE

ROUGE-1: Unigram overlap
ROUGE-2: Bigram overlap
ROUGE-L: Longest common subsequence

BERTScore Limitations

May favor models mirroring its own architecture
Lacks consideration for sentence-level syntax
Diminished effectiveness for context beyond word-level (idioms, cultural nuances)
Not suitable for evaluating coding models on programming tasks

Key Source Files

ads/aqua/evaluation/evaluation.py — AquaEvaluationApp (create, list, get, load_metrics)
ads/aqua/evaluation/entities.py — CreateAquaEvaluationDetails, AquaEvalMetrics
ads/aqua/config/evaluation/evaluation_service_config.py — Metric configuration

aqua-evaluation

Más de este repositorio

Más de este repositorio

AQUA Model Evaluation

Supported Metrics

Dataset Format

Python SDK Usage

Import

Create Evaluation

Evaluate a Model (not deployment) Directly

Evaluate Stacked/Multi-Model Deployment

With Experiment Tracking

List Evaluations

Get Evaluation Details

CLI Usage

Create Evaluation

List / Get Evaluations

Interpreting Results

BERTScore

ROUGE

BERTScore Limitations

Key Source Files

AQUA Model Evaluation

Supported Metrics

Dataset Format

Python SDK Usage

Import

Create Evaluation

Evaluate a Model (not deployment) Directly

Evaluate Stacked/Multi-Model Deployment

With Experiment Tracking

List Evaluations

Get Evaluation Details

CLI Usage

Create Evaluation

List / Get Evaluations

Interpreting Results

BERTScore

ROUGE

BERTScore Limitations

Key Source Files