Research papers

Below are research papers published by the Writer AI, NLP, and Data Science team.

Expect the Unexpected: FailSafe Long Context QA for Finance

February 2025

We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance.

We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models.

The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications.

Writing in the Margins: Better inference pattern for long context retrieval

Link to paper

August 2024

In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information (“margins”) that guide the model towards specific tasks.

This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE).

Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at https://github.com/writer/writing-in-the-margins.

Comparative analysis of retrieval systems in the real world

Link to paper

May 2024

This research paper presents a comprehensive analysis of integrating advanced language models with search and retrieval systems in the fields of information retrieval and natural language processing. The objective is to evaluate and compare various state-of-the-art methods based on their performance in terms of accuracy and efficiency.

The analysis explores different combinations of technologies, including Azure Cognitive Search Retriever with GPT-4, Pinecone’s Canopy framework, Langchain with Pinecone and different language models (OpenAI, Cohere), LlamaIndex with Weaviate Vector Store’s hybrid search, Google’s RAG implementation on Cloud Vertex AI Search, Amazon SageMaker’s RAG, and a novel approach called KG-FID Retrieval.

The motivation for this analysis arises from the increasing demand for robust and responsive question-answering systems in various domains. The RobustQA metric is used to evaluate the performance of these systems under diverse paraphrasing of questions. The report aims to provide insights into the strengths and weaknesses of each method, facilitating informed decisions in the deployment and development of AI-driven search and retrieval systems.

OmniACT: a dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

Link to paper

February 2024

For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention.

In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent’s capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as “Play the next song”, as well as longer horizon tasks such as “Send an email to John Doe mentioning the time and place to meet”. Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task.

We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents.Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.

Fusion-in-decoder: achieving state-of-the-art open-domain QA performance

Link to paper

September 2023

Open-domain question answering (QA) has recently made significant progress, with generative models like Transformers demonstrating impressive performance. However, these models are computationally expensive to train and query, limiting their practical application. In this whitepaper, we introduce a novel approach to open-domain QA that combines the strengths of retrieval and generative models, aiming to achieve more efficient and accurate question answering.

Our approach, termed Fusion-in-Decoder, retrieves informative passages and leverages them with a sequence-to-sequence model to generate answers. This method demonstrates state-of-the-art results on benchmarks like Natural Questions and TriviaQA, and offers a highly scalable framework for aggregating and combining information from multiple passages.

Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning

Link to paper

July 2023

In this paper, we introduce the IFS, a metric for instruction following. The metric detects language models’ ability to follow instructions. First, IFS can distinguish between base and instruct models. We benchmark public bases and models, showing they’re well-formatted responses to partial and full sentences are effective. The metric can be used as a measure between model classes.

We compute IFS for Supervised early stopping. Follow instructions early and finetune later. As an example, we show model predictions are objective. We show that semantic changes can be caused by auxiliary metric ObjecQA. When IFS decomposes, it steepens. IFS and semantic factors start a controllable instruct trend. Tuning and querying opens minimal instruct interfaces Foundation models are short-lived.

Palmyra-Med: instruction-based fine-tuning of LLMs enhancing medical domain performance

Link to paper

July 2023

Palmyra-20b and Palmyra-40b are two cutting-edge large language models (LLMs) that were fine-tuned and evaluated for medical language understanding tasks. By applying instruction-based fine-tuning on a custom-curated medical dataset of 200,000 examples, we create novel, fine-tuned models, Palmyra-Med-20b and Palmyra-Med-40b. Performance is then measured across multiple medical knowledge datasets, including PubMedQA and MedQA.

Our fine-tuned models outperform both their base counterparts and other LLMs pre-trained on domain-specific knowledge. This research demonstrates the effectiveness of instruction-based fine-tuning in enhancing LLMs performance in the medical domain.

Grammatical error correction: a survey of the state of the art

Link to paper

April 2023

Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art.

In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.

Getting started

Guides

Best practices