Structured output with Instructor
This tutorial will guide you through using Writer with Instructor, a Python library that makes it easy to get structured data like JSON from LLMs.
Prerequisites
- Python 3.8 or higher installed
- Poetry installed (see their installation guide)
- A Writer API key (follow the Quickstart to create an app and obtain an API key)
Getting started
To get started with Instructor, you’ll need to install the library and set up your environment.
Obtain an API key
First, make sure that you’ve signed up for a Writer AI Studio account and obtained an API key. You can follow the Quickstart to create an app and obtain an API key.
Install instructor
Once you’ve done so, install instructor
with Writer support by running:
Set the `WRITER_API_KEY` environment variable
Make sure to set the WRITER_API_KEY
environment variable with your Writer API key or pass it as an argument to the Writer constructor.
Basic usage
Following is a simple example of how to use instructor
with Writer:
This code creates a simple data model with two fields: name
and age
. It then uses the instructor.from_writer
function to create a client
object that uses the Writer API to extract structured data from a text.
Building a data repair tool with Instructor and Writer
You can also use Instructor to do advanced data extraction and repair. In this example, you’ll build a Python application that extracts structured data from text, CSV, and PDF files using Instructor and Writer. This application will:
- Parse text, CSV, and PDF files
- Extract and validate structured data using Instructor and Writer
- Output the results in CSV format
The finished code for this tutorial is available in the API tutorials GitHub repository.
Setting up the project
Create a new project
First, create a new project and set up Poetry for dependency management:
Add dependencies
Add the required dependencies to your project:
Set up your environment variables
Create a .env
file in your project root and add your Writer API key:
Create `main.py` file and add imports
Create a main.py
file and add the following imports:
Here’s what each import is used for:
asyncio
: This is used to run the application on multiple files concurrently.csv
: This is used to write the extracted data to a CSV file.json
: This is used to write the extracted data to a JSON file.os
: This is used to read the files.instructor
: Theinstructor
library is used for structured output.writerai
: This is the Writer Python SDK, which is used to interact with the Writer API.typing
andpydantic
: These modules are used to define the types for fields in theUserExtract
class defined in the next step.dotenv
: Thedotenv
module is used to load the.env
file that contains your Writer API key.
Setting up Writer client
Initialize the Writer client for both synchronous and asynchronous operations:
Defining the data model
In order for Instructor to extract structured output, you need to define a data model using Pydantic. To define the data model, create a UserExtract
class to represent the data you want to extract:
This data model defines the fields that you want to extract from the files. The first_name
and last_name
fields are validated to ensure they start with an uppercase letter and contain only letters. In this example, the email
field is a simple string field, though you could also use a Pydantic field to validate the email format.
Parsing the files
With the data model defined, you can now implement file parsing. This involves creating functions to open the files and extract the text.
Create a function to handle file processing
Implement the main file handler function that orchestrates the entire process:
This function handles the file processing logic, including file type validation, text extraction, data repair, and CSV generation.
Create a function to read the files
Next, create a function to read the files based on the given path and extension:
Extract the file content
Next, create a function to extract the text from the files. For text files, this function simply reads the file contents. For PDFs, the function uploads the PDF using Writer’s file upload endpoint, parses the text using PDF parsing tool, and then deletes the file from Writer’s servers using the file delete endpoint:
Extracting and repairing the data
With the file content extracted, you can now implement data extraction and repair using Instructor and Writer.
Create a function to repair the data
Create a function to extract and repair data using Instructor:
Implementing CSV generation
Add a function to save the extracted data to CSV:
Creating the main handler
Finally, implement the main function to process multiple files concurrently:
In this example, the input paths are hardcoded, but you could modify the application to accept input paths from the command line or a web interface, or read from a directory or database.
Testing the application
Your data repair tool is now ready to use. To test it, follow these steps:
Create an `example_data` directory
Create an example_data
directory and add some test files:
- A text file with user information
- A PDF file with user information
You can use the example data provided in the GitHub repository for this tutorial. If you provide your own, be sure to update the main.py
file to point to the new files.
Run the application
Run the application:
The application will process both files concurrently and generate CSV files containing the extracted user information.
Conclusion
You’ve now seen basic and advanced usage of Writer with Instructor. To learn more about Instructor, check out the Instructor documentation. Structured output is a powerful feature that can help you build more accurate and reliable applications, especially combined with tool calling.