Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

ai-parsing-data

Sterne6

Forks1

Aktualisiert13. Juni 2026 um 13:41

Pull structured data from messy text using AI. Use when parsing invoices, extracting fields from emails, scraping entities from articles, converting unstructured text to JSON, extracting contact info, parsing resumes, reading forms, pulling data from transcripts (VTT, LiveKit, Recall), extracting fields from Langfuse traces, or any task where messy text goes in and clean structured data comes out. Also use when emails are messy and lack structure, or structured data extraction from unstructured content is unreliable., extract entities from text, parse PDF with AI, structured extraction from unstructured text, OCR plus AI extraction, convert email to structured data, pull fields from documents automatically, AI data entry automation, invoice parsing, resume parsing with AI, medical record extraction.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

lebsral

lebsral/DSPy-Programming-not-prompting-LMs-skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

SoftwareentwicklerInformatik- und Mathematikberufe·SOC 15-1252

Datei-Explorer

4 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

ai-auditing-code

lebsral/DSPy-Programming-not-prompting-LMs-skills

Review DSPy code for correctness and best practices. Use when you want a code review of your DSPy program, need to check if your AI code follows best practices, want to find anti-patterns in your DSPy usage, or need a quality audit of your AI implementation. Also use for DSPy code review, is my DSPy code correct, review my AI code, best practices check, DSPy anti-patterns, code quality audit, am I using DSPy right, sanity check my AI code, peer review my DSPy program, does this follow DSPy conventions.

2026-06-136

ai-checking-outputs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Verify and validate AI output before it reaches users. Use when you need guardrails, output validation, safety checks, content filtering, fact-checking AI responses, catching hallucinations, preventing bad outputs, or quality gates. Also used for - AI output looks right but is wrong, how to validate JSON from LLM, LLM returns invalid data, catch bad AI outputs before users see them, output quality gate, AI guardrails for production, verify LLM did not hallucinate fields, post-processing LLM responses. Uses dspy.Refine (iterative with feedback) and dspy.BestOfN (sampling, pick best).

2026-06-136

ai-choosing-architecture

lebsral/DSPy-Programming-not-prompting-LMs-skills

Pick the right DSPy module and architecture for your AI feature. Use when you are not sure whether to use Predict, ChainOfThought, ReAct, or a pipeline, need to choose between DSPy patterns, want architecture advice for your AI feature, or are deciding between a single module and a multi-step pipeline. Also use for which DSPy module should I use, Predict vs ChainOfThought, when to use ReAct, single module vs pipeline, DSPy architecture decision, CoT vs PoT vs ReAct, do I need a pipeline, module selection guide, DSPy pattern selection, how to structure my DSPy program.

2026-06-136

ai-cleaning-data

lebsral/DSPy-Programming-not-prompting-LMs-skills

Normalize and fix messy data fields using AI. Use when normalizing addresses, standardizing company names, fixing inconsistent date formats, cleaning CSV data before import, correcting typos in bulk data, normalizing phone number formats, standardizing job titles, cleaning up free-text fields, data quality improvement with AI, fixing formatting inconsistencies, bulk data normalization, preparing messy data for analysis, AI-powered data wrangling.

2026-06-136

ai-cutting-costs

lebsral/DSPy-Programming-not-prompting-LMs-skills

Reduce your AI API bill. Use when AI costs are too high, API calls are too expensive, you want to use cheaper models, optimize token usage, reduce LLM spending, route easy questions to cheap models, or make your AI feature more cost-effective. Also used for GPT-4 costs too much for production, AI bill keeps growing, how to reduce OpenAI costs, optimize LLM token usage, smart model routing saves money, prompt is too long and expensive, cheaper than GPT-4 with same quality.

2026-06-136

ai-do

lebsral/DSPy-Programming-not-prompting-LMs-skills

Describe your AI problem and get routed to the right skill with a ready-to-use prompt. Use when you are not sure which ai- skill to use, want help picking the right approach, or just want to describe what you need in plain language. Also use this when someone says I want to build an AI that..., how do I make my AI..., or describes any AI/LLM task without naming a specific skill, I need AI but do not know where to start, which AI pattern should I use, what is the best way to add AI to my app, recommend an AI approach, AI feature discovery, too many AI options, overwhelmed by AI frameworks, just tell me what to build, new to DSPy, beginner AI project help, which LLM pattern fits my use case, confused about AI architecture, help me figure out my AI approach.

2026-06-136

name

ai-parsing-data

description

Build an AI Data Parser

Guide the user through building AI that pulls structured data out of messy text. Uses DSPy extraction — define the output shape, and the AI fills it in.

Step 1: Define what to extract

Ask the user:

What are you parsing? (emails, invoices, resumes, transcripts, articles, forms, etc.)
What fields do you need? (names, dates, amounts, entities, etc.)
Are any fields optional? (some documents might not have every field)
What's the output format? (flat fields, list of objects, nested structure)
Do you have examples of correct extractions? (even a few help with optimization)

Step 2: Build the parser

Simple field extraction

For pulling a known set of fields from text:

import dspy

# Configure any LM provider
lm = dspy.LM("openai/gpt-4o-mini")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)

class ParseContact(dspy.Signature):
    """Extract contact information from the text."""
    text: str = dspy.InputField(desc="Text containing contact information")
    name: str = dspy.OutputField(desc="Person's full name")
    email: str = dspy.OutputField(desc="Email address")
    phone: str = dspy.OutputField(desc="Phone number")

parser = dspy.ChainOfThought(ParseContact)

ChainOfThought adds reasoning before extraction, which helps the model think through which text maps to which field — typically 5-15% more accurate than bare Predict on ambiguous inputs.

Structured output with Pydantic

For complex or nested output, use Pydantic models. DSPy handles the serialization automatically:

from pydantic import BaseModel, Field
from typing import Optional

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Person(BaseModel):
    name: str
    age: Optional[int] = None
    email: Optional[str] = None
    address: Address
    skills: list[str]

class ParsePerson(dspy.Signature):
    """Extract person details from the text."""
    text: str = dspy.InputField()
    person: Person = dspy.OutputField()

parser = dspy.ChainOfThought(ParsePerson)
result = parser(text="John Doe, 32, lives at 123 Main St, Springfield IL 62701. Expert in Python and SQL.")
print(result.person)  # Person(name='John Doe', age=32, ...)

Use Optional for fields that might not appear in every document — this tells the model it's OK to return None instead of guessing.

Output format for small models

Small models (<4B params) produce frequent JSON syntax errors — unclosed braces, missing quotes, trailing commas. Switching to YAML output eliminates these failures entirely while preserving structured data. In one production case (3.6M historical name records), frontier models achieved ~70% accuracy while fine-tuned 0.8B-4B models using YAML output hit 94-96%.

import yaml

class ParsePersonYAML(dspy.Signature):
    """Extract person details from the text. Return the result as YAML, not JSON."""
    text: str = dspy.InputField()
    person_yaml: str = dspy.OutputField(desc="extracted person data in YAML format")

parser = dspy.Predict(ParsePersonYAML)
result = parser(text="John Doe, 32, john@example.com")

# Parse YAML back into structured data
person_data = yaml.safe_load(result.person_yaml)

Use this pattern when running sub-4B parameter models (Qwen, Phi, Gemma) locally. Larger models (GPT-4o, Claude) handle JSON fine — stick with Pydantic output fields for those.

List extraction

When you need to pull a variable number of items (entities, line items, experiences):

class Entity(BaseModel):
    name: str
    type: str = Field(description="Type: person, organization, location, or date")

class ParseEntities(dspy.Signature):
    """Extract all named entities from the text."""
    text: str = dspy.InputField()
    entities: list[Entity] = dspy.OutputField(desc="All entities found in the text")

parser = dspy.ChainOfThought(ParseEntities)

Step 3: Load your data

From files

from pathlib import Path

# Single file
text = Path("document.txt").read_text()
result = parser(text=text)

# Directory of files
documents = []
for path in Path("documents/").glob("*.txt"):
    documents.append({"file": path.name, "text": path.read_text()})

From a CSV

import pandas as pd

df = pd.read_csv("emails.csv")  # column: body
results = []
for _, row in df.iterrows():
    result = parser(text=row["body"])
    results.append(result.person.model_dump())  # Pydantic → dict

# Save extracted data
pd.DataFrame(results).to_csv("extracted.csv", index=False)

From transcripts (VTT, LiveKit, Recall)

Transcripts are a common parsing source — extracting caller info, action items, decisions, or structured summaries from conversations.

WebVTT (.vtt) files:

import re

def load_vtt(path):
    """Extract text from a VTT transcript, stripping timestamps."""
    text = open(path).read()
    lines = [line.strip() for line in text.split("\n")
             if line.strip() and not line.startswith("WEBVTT")
             and not re.match(r"\d{2}:\d{2}", line)
             and not line.strip().isdigit()]
    return " ".join(lines)

LiveKit transcripts:

import json

def load_livekit_transcript(path):
    """Extract text from a LiveKit transcript JSON export."""
    data = json.load(open(path))
    segments = data.get("segments", data.get("results", []))
    return " ".join(seg.get("text", "") for seg in segments)

Recall.ai transcripts:

def load_recall_transcript(transcript_data):
    """Extract text from a Recall.ai transcript response."""
    return " ".join(
        entry["words"] for entry in transcript_data if entry.get("words")
    )

Example: extracting structured data from a call transcript:

class CallSummary(BaseModel):
    caller_name: Optional[str] = None
    issue_summary: str
    resolution: Optional[str] = None
    follow_up_needed: bool
    action_items: list[str]

class ParseCallTranscript(dspy.Signature):
    """Extract structured information from a customer call transcript."""
    transcript: str = dspy.InputField(desc="Full call transcript text")
    summary: CallSummary = dspy.OutputField()

parser = dspy.ChainOfThought(ParseCallTranscript)
transcript = load_livekit_transcript("call_001.json")
result = parser(transcript=transcript)

From Langfuse traces

Extract structured data from AI interactions logged in Langfuse:

from langfuse import Langfuse

langfuse = Langfuse()
traces = langfuse.fetch_traces(limit=100).data

# Parse each trace's input/output for structured fields
for trace in traces:
    if trace.input:
        text = trace.input.get("message", str(trace.input))
        result = parser(text=text)

Step 4: Handle messy data

Real-world text is messy. Use a reward function with dspy.Refine to catch bad extractions and retry:

class ValidatedParser(dspy.Module):
    def __init__(self):
        self.parse = dspy.ChainOfThought(ParseContact)

    def forward(self, text):
        return self.parse(text=text)

def contact_reward(args, pred):
    score = 1.0
    if not pred.email or "@" not in pred.email:
        score -= 0.4  # Email should contain @
    phone_digits = pred.phone.replace("-", "").replace(" ", "") if pred.phone else ""
    if len(phone_digits) < 10:
        score -= 0.3  # Phone number should have at least 10 digits
    return max(score, 0.0)

validated_parser = dspy.Refine(
    module=ValidatedParser(),
    N=3,
    reward_fn=contact_reward,
    threshold=0.7,
)

dspy.Refine retries the extraction up to N times, keeping the attempt with the highest reward score. Penalize each failed constraint proportionally to its importance.

Hybrid extraction with regex backstop

The model does the heavy lifting, then regex sweeps for anything it missed. This pattern improved F1 from 0.733 (model-only) to 0.929 (hybrid) in a production privacy extraction system.

import re

def hybrid_extract(text, parser, patterns):
    """Run model extraction, then fill gaps with regex patterns."""
    result = parser(text=text)

    # Regex backstop — catch fields the model missed
    for field, pattern in patterns.items():
        model_value = getattr(result, field, None)
        if not model_value:
            match = re.search(pattern, text)
            if match:
                result.__dict__[field] = match.group(1) if match.groups() else match.group()

    return result

# Define regex patterns for common fields
patterns = {
    "email": r"[\w.+-]+@[\w-]+\.[\w.-]+",
    "phone": r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}",
    "zip_code": r"\b\d{5}(?:-\d{4})?\b",
}

result = hybrid_extract(messy_text, parser, patterns)

Use this when your input has a mix of structured patterns (emails, phones, dates) and free-form text. The model handles ambiguous fields like names and summaries; regex catches well-formatted fields the model occasionally skips.

Handling missing fields

When a field genuinely isn't in the text, you want the model to say so rather than hallucinate a value. Use Optional types in your Pydantic model, and add a validation note in the signature docstring:

class ParseContact(dspy.Signature):
    """Extract contact info from the text. Return None for fields not present — do not guess."""
    text: str = dspy.InputField()
    name: str = dspy.OutputField(desc="Person's full name")
    email: Optional[str] = dspy.OutputField(desc="Email address, or None if not found")
    phone: Optional[str] = dspy.OutputField(desc="Phone number, or None if not found")

Step 5: Evaluate quality

from dspy.evaluate import Evaluate

def parsing_metric(example, prediction, trace=None):
    """Score based on field-level accuracy (partial credit)."""
    correct = 0
    total = 0
    for field in ["name", "email", "phone"]:
        expected = getattr(example, field, None)
        predicted = getattr(prediction, field, None)
        if expected is not None:
            total += 1
            if predicted and expected.lower().strip() == predicted.lower().strip():
                correct += 1
    return correct / total if total > 0 else 0.0

evaluator = Evaluate(devset=devset, metric=parsing_metric, num_threads=4, display_progress=True)
score = evaluator(parser)
print(f"Baseline accuracy: {score}%")

For Pydantic outputs, compare field-by-field or use the model's .model_dump() to compare dicts. Partial credit (scoring each field independently) is better than all-or-nothing for extraction tasks — it tells you which specific fields are causing problems.

Step 6: Optimize and deploy

# Optimize
optimizer = dspy.BootstrapFewShot(metric=parsing_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(parser, trainset=trainset)

# Evaluate improvement
improved = evaluator(optimized)
print(f"Optimized accuracy: {improved}%")

# Save for production
optimized.save("parser.json")

# Load later
parser = dspy.ChainOfThought(ParseContact)
parser.load("parser.json")

Batch processing

For parsing many documents at once:

import json

results = []
errors = []

for doc in documents:
    try:
        result = optimized(text=doc["text"])
        results.append({
            "source": doc["file"],
            **result.person.model_dump()  # flatten Pydantic fields
        })
    except Exception as e:
        errors.append({"source": doc["file"], "error": str(e)})

# Save results
with open("extracted.json", "w") as f:
    json.dump(results, f, indent=2)

if errors:
    print(f"{len(errors)} documents failed to parse — check errors list")

Additional resources

For worked examples (invoices, resumes, entities, relations, forms), see examples.md
Need summaries instead of structured data? Use /ai-summarizing
AI missing items on complex inputs? Use /ai-decomposing-tasks
Want to measure and improve further? Use /ai-improving-accuracy
Need to generate training data? Use /ai-generating-data

Gotchas

Pydantic models must be JSON-serializable — avoid custom types, datetime objects, or complex validators in output models. Stick to str, int, float, bool, list, dict, and nested Pydantic models.
Optional fields need an explicit "return None" instruction — use field: Optional[str] = dspy.OutputField(desc="... or None if not found") and state "return None for missing fields" in the signature docstring, or the model will hallucinate values for missing fields instead of returning None.
List extraction undercounts by default — when extracting lists of items (e.g., "all people mentioned"), the LM tends to stop early. Set max_tokens higher and add a "be exhaustive" instruction in the signature docstring.
Long inputs get truncated silently — if your input text exceeds the model's context window, DSPy doesn't warn you. Chunk long documents before parsing, or use a model with a larger context window.
Nested Pydantic models increase failure rate — each level of nesting adds extraction difficulty. Flatten where possible, or break into multiple extraction steps (extract outer structure first, then fill in nested fields).
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do