Structured output with Instructor

This tutorial will guide you through using Writer with Instructor, a Python library that makes it easy to get structured data like JSON from LLMs.

Prerequisites

Python 3.8 or higher installed
Poetry installed (see their installation guide)
A Writer API key (follow the Quickstart to obtain an API key)

Getting started

To get started with Instructor, you’ll need to install the library and set up your environment.

Obtain an API key

First, make sure that you’ve signed up for a Writer AI Studio account and obtained an API key. You can follow the Quickstart to obtain an API key.

Install instructor

Once you’ve done so, install instructor with Writer support by running:

pip install instructor[writer]

Set the `WRITER_API_KEY` environment variable

Make sure to set the WRITER_API_KEY environment variable with your Writer API key or pass it as an argument to the Writer constructor.

Basic usage

Following is a simple example of how to use instructor with Writer:

import instructor
from writerai import Writer
from pydantic import BaseModel

# Initialize Writer client
client = instructor.from_writer(Writer(api_key="your API key"))


class User(BaseModel):
    name: str
    age: int


# Extract structured data
user = client.chat.completions.create(
    model="palmyra-x5",
    messages=[{"role": "user", "content": "Extract: John is 30 years old"}],
    response_model=User,
)

print(user)
#> name='John' age=30

This code creates a simple data model with two fields: name and age. It then uses the instructor.from_writer function to create a client object that uses the Writer API to extract structured data from a text.

Building a data repair tool with Instructor and Writer

You can also use Instructor to do advanced data extraction and repair. In this example, you’ll build a Python application that extracts structured data from text, CSV, and PDF files using Instructor and Writer. This application will:

Parse text, CSV, and PDF files
Extract and validate structured data using Instructor and Writer
Output the results in CSV format

The finished code for this tutorial is available in the API tutorials GitHub repository.

Setting up the project

Create a new project

First, create a new project and set up Poetry for dependency management:

mkdir instructor-and-writer-tutorial
cd instructor-and-writer-tutorial
poetry init -y

Add dependencies

Add the required dependencies to your project:

poetry add instructor writer-sdk python-dotenv pydantic

Set up your environment variables

Create a .env file in your project root and add your Writer API key:

WRITER_API_KEY=your_api_key_here

Create `main.py` file and add imports

Create a main.py file and add the following imports:

import asyncio
import csv
import json
import os
from typing import Annotated, Type, Iterable, List

import instructor
from dotenv import load_dotenv
from pydantic import BaseModel, AfterValidator, Field
from writerai import Writer, AsyncWriter

load_dotenv()

Here’s what each import is used for:

asyncio: This is used to run the application on multiple files concurrently.
csv: This is used to write the extracted data to a CSV file.
json: This is used to write the extracted data to a JSON file.
os: This is used to read the files.
instructor: The instructor library is used for structured output.
writerai: This is the Writer Python SDK, which is used to interact with the Writer API.
typing and pydantic: These modules are used to define the types for fields in the UserExtract class defined in the next step.
dotenv: The dotenv module is used to load the .env file that contains your Writer API key.

Setting up Writer client

Initialize the Writer client for both synchronous and asynchronous operations:

writer_client = Writer()
async_writer_client = AsyncWriter()

Defining the data model

In order for Instructor to extract structured output, you need to define a data model using Pydantic. To define the data model, create a UserExtract class to represent the data you want to extract:

class UserExtract(BaseModel):
    @staticmethod
    def first_last_name_validator(v):
        if v[0] != v[0].upper() or v[1:] != v[1:].lower() or not v.isalpha():
            raise ValueError("Name must contain only letters and start with uppercase letter")
        return v

    first_name: Annotated[str, AfterValidator(first_last_name_validator)] = Field(
        ..., description="The name of the user"
    )
    last_name: Annotated[str, AfterValidator(first_last_name_validator)] = Field(
        ..., description="The surname of the user"
    )
    email: str

This data model defines the fields that you want to extract from the files. The first_name and last_name fields are validated to ensure they start with an uppercase letter and contain only letters. In this example, the email field is a simple string field, though you could also use a Pydantic field to validate the email format.

Parsing the files

With the data model defined, you can now implement file parsing. This involves creating functions to open the files and extract the text.

Create a function to handle file processing

Implement the main file handler function that orchestrates the entire process:

async def handle_file(file_path: str, response_model: Type[BaseModel], output_path: str = None) -> None:
    extension = os.path.splitext(file_path)[1]
    name = os.path.splitext(os.path.basename(file_path))[0]

    file_text = await fetch_file_text(file_path, name, extension)
    repaired_entities = await repair_data(file_text, response_model)

    print(f"Number of entities extracted from {name}{extension}: {len(repaired_entities)}")
    return generate_csv(repaired_entities, response_model, output_path)

This function handles the file processing logic, including file type validation, text extraction, data repair, and CSV generation.

Create a function to read the files

Next, create a function to read the files based on the given path and extension:

async def fetch_file_text(file_path: str, name: str, extension: str) -> str:
    allowed_extensions = [".txt", ".csv", ".pdf"]
    if extension not in allowed_extensions:
        raise ValueError(f"File extension {extension} is not allowed. Only {', '.join(allowed_extensions)}")

    print(f"Reading {name}{extension} content...")
    with open(file_path, 'rb') as file:
        file_contents = file.read()

    return await parse_file(file_contents, name, extension)

Extract the file content

Next, create a function to extract the text from the files. For text files, this function simply reads the file contents. For PDFs, the function uploads the PDF using Writer’s file upload endpoint, parses the text using PDF parsing tool, and then deletes the file from Writer’s servers using the file delete endpoint:

async def parse_file(file_bytes_content: bytes, name: str, extension: str) -> str:
    file_text = ""

    if extension == ".pdf":
        print(f"Uploading {name}{extension} content to writer servers...")
        file = await async_writer_client.files.upload(
            content=file_bytes_content,
            content_disposition=f"attachment; filename={name + extension}",
            content_type="application/octet-stream",
        )

        print(f"Converting {name}{extension} content from PDF to text...")
        file_text = await async_writer_client.tools.parse_pdf(
            file_id=file.id,
            format="text",
        )

        print(f"Deleting {name}{extension} from writer servers...")
        await async_writer_client.files.delete(file.id)
    else:
        print(f"Converting {name}{extension} content...")
        file_text = file_bytes_content.decode("utf-8")

    return file_text

Extracting and repairing the data

With the file content extracted, you can now implement data extraction and repair using Instructor and Writer.

Create a function to repair the data

Create a function to extract and repair data using Instructor:

async def repair_data(file_text: str, response_model: Type[BaseModel]) -> List[BaseModel]:
    instructor_client = instructor.from_writer(client=async_writer_client)

    if not issubclass(response_model, BaseModel):
        raise ValueError("Response model must be subclass of pydantic BaseModel")

    print("Extracting data featuring Instructor tools...")
    return await instructor_client.chat.completions.create(
        model="palmyra-x5",
        response_model=Iterable[response_model],
        max_retries=5,
        messages=[
            {"role": "user", "content": f"Extract entities from {file_text}"},
        ],
    )

Implementing CSV generation

Add a function to save the extracted data to CSV:

def generate_csv(entities: List[BaseModel], response_model: Type[BaseModel], output_path: str = None) -> None:
    fieldnames = list(response_model.model_json_schema()["properties"].keys())
    file_path = f"{response_model.__name__}.csv"

    if output_path:
        file_path = output_path + file_path
        os.makedirs(os.path.dirname(file_path), exist_ok=True)

    with open(file_path, "w") as file:
        dict_writer = csv.DictWriter(file, fieldnames=fieldnames)
        dict_writer.writeheader()
        for entity in entities:
            dict_writer.writerow(json.loads(response_model(**entity.model_dump()).model_dump_json()))

Creating the main handler

Finally, implement the main function to process multiple files concurrently:

async def main():
    data = [
        ("example_data/ExampleFileTextFormat.txt", UserExtract, None),
        ("example_data/ExampleFilePDFFormat.pdf", UserExtract, "out/"),
    ]
    tasks = []

    for row in data:
        tasks.append(handle_file(row[0], row[1], row[2]))

    await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(main())

In this example, the input paths are hardcoded, but you could modify the application to accept input paths from the command line or a web interface, or read from a directory or database.

Testing the application

Your data repair tool is now ready to use. To test it, follow these steps:

Create an `example_data` directory

Create an example_data directory and add some test files:

A text file with user information
A PDF file with user information

You can use the example data provided in the GitHub repository for this tutorial. If you provide your own, be sure to update the main.py file to point to the new files.

Run the application

Run the application:

poetry run python main.py

The application will process both files concurrently and generate CSV files containing the extracted user information.

Conclusion

You’ve now seen basic and advanced usage of Writer with Instructor. To learn more about Instructor, check out the Instructor documentation. Structured output is a powerful feature that can help you build more accurate and reliable applications, especially combined with tool calling.

Getting started

Core concepts

Models and pricing

Chat completions

No-code agents

Knowledge Graphs

Tool calling

Additional capabilities

Integrations

Supervise

Security and compliance

Resources

Structured output with Instructor

Prerequisites

Getting started

Basic usage

Building a data repair tool with Instructor and Writer

Setting up the project

Defining the data model

Parsing the files

Extracting and repairing the data

Creating the main handler

Testing the application

Conclusion

Getting started

Core concepts

Models and pricing

Chat completions

No-code agents

Knowledge Graphs

Tool calling

Additional capabilities

Integrations

Supervise

Security and compliance

Resources

​Prerequisites

​Getting started

​Basic usage

​Building a data repair tool with Instructor and Writer

​Setting up the project

​Defining the data model

​Parsing the files

​Extracting and repairing the data

​Creating the main handler

​Testing the application

​Conclusion

Prerequisites

Getting started

Basic usage

Building a data repair tool with Instructor and Writer

Setting up the project

Defining the data model

Parsing the files

Extracting and repairing the data

Creating the main handler

Testing the application

Conclusion