| name | Extract structured data from unstructured files (PDF, PPTX, DOCX...) |
| description | Invoke this skill BEFORE implementing any structured data extraction from documents to learn the correct llama_cloud_services API usage. Required reading before writing extraction code. Requires llama_cloud_services package and LLAMA_CLOUD_API_KEY as an environment variable. |
Structured Data Extraction
Quick start
- Define a schema for the for the data you would like to extract:
from pydantic import BaseModel, Field
class Resume(BaseModel):
name: str = Field(description="Full name of candidate")
email: str = Field(description="Email address")
skills: list[str] = Field(description="Technical skills and technologies")
NOTE: Use basic types when possible. Avoid nested dictionaries. Lists are ok.
- Create a LlamaExtract instance:
from llama_cloud_services import LlamaExtract
extractor = LlamaExtract(
show_progress=True,
check_interval=5,
)
- Define the extraction configuration:
from llama_cloud import ExtractConfig, ExtractMode
extract_config = ExtractConfig(
extraction_mode=ExtractMode.MULTIMODAL,
extraction_target=ExtractTarget.PER_DOC,
system_prompt="<Insert relevant context for extraction>",
high_resolution_mode=True,
nvalidate_cache=False,
cite_sources=True,
use_reasoning=True,
confidence_scores=True,
)
- Extract the data from the document:
result = extractor.extract(Resume, config, "resume.pdf")
print(Resume.model_validate(result.data))
For more detailed code implementations, see REFERENCE.md.
Requirements
The llama_cloud_services package must be installed in your environment (with it come the pydantic and llama_cloud packages):
pip install llama_cloud_services
And the LLAMA_CLOUD_API_KEY must be available as an environment variable:
export LLAMA_CLOUD_API_KEY="..."