Research papers
Below are research papers published by the Writer AI, NLP, and Data Science team.
In this paper, we introduce the IFS, a metric for instruction following. The metric detects language models’ ability to follow instructions. First, IFS can distinguish between base and instruct models. We benchmark public bases and models, showing they’re well-formatted responses to partial and full sentences are effective. The metric can be used as a measure between model classes.
We compute IFS for Supervised early stopping. Follow instructions early and finetune later. As an example, we show model predictions are objective. We show that semantic changes can be caused by auxiliary metric ObjecQA. When IFS decomposes, it steepens. IFS and semantic factors start a controllable instruct trend. Tuning and querying opens minimal instruct interfaces Foundation models are short-lived.
Palmyra-20b and Palmyra-40b are two cutting-edge large language models (LLMs) that were fine-tuned and evaluated for medical language understanding tasks. By applying instruction-based fine-tuning on a custom-curated medical dataset of 200,000 examples, we create novel, fine-tuned models, Palmyra-Med-20b and Palmyra-Med-40b. Performance is then measured across multiple medical knowledge datasets, including PubMedQA and MedQA.
Our fine-tuned models outperform both their base counterparts and other LLMs pre-trained on domain-specific knowledge. This research demonstrates the effectiveness of instruction-based fine-tuning in enhancing LLMs performance in the medical domain.
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art.
In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
Open-domain question answering (QA) has recently made significant progress, with generative models like Transformers demonstrating impressive performance. However, these models are computationally expensive to train and query, limiting their practical application. In this whitepaper, we introduce a novel approach to open-domain QA that combines the strengths of retrieval and generative models, aiming to achieve more efficient and accurate question answering.
Our approach, termed Fusion-in-Decoder, retrieves informative passages and leverages them with a sequence-to-sequence model to generate answers. This method demonstrates state-of-the-art results on benchmarks like Natural Questions and TriviaQA, and offers a highly scalable framework for aggregating and combining information from multiple passages.
For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention.
In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent’s capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as “Play the next song”, as well as longer horizon tasks such as “Send an email to John Doe mentioning the time and place to meet”. Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task.
We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents.Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.
This research paper presents a comprehensive analysis of integrating advanced language models with search and retrieval systems in the fields of information retrieval and natural language processing. The objective is to evaluate and compare various state-of-the-art methods based on their performance in terms of accuracy and efficiency.
The analysis explores different combinations of technologies, including Azure Cognitive Search Retriever with GPT-4, Pinecone’s Canopy framework, Langchain with Pinecone and different language models (OpenAI, Cohere), LlamaIndex with Weaviate Vector Store’s hybrid search, Google’s RAG implementation on Cloud Vertex AI Search, Amazon SageMaker’s RAG, and a novel approach called KG-FID Retrieval.
The motivation for this analysis arises from the increasing demand for robust and responsive question-answering systems in various domains. The RobustQA metric is used to evaluate the performance of these systems under diverse paraphrasing of questions. The report aims to provide insights into the strengths and weaknesses of each method, facilitating informed decisions in the deployment and development of AI-driven search and retrieval systems.