Training Data for Writer LLM

The Palmyra Large Language Model has demonstrated impressive performance in various natural language processing tasks. One of the key factors behind its success is the diverse and extensive training data it has been exposed to, which has given it expertise in business and professional writing.

Data Sources and Contributions

  • MC4 Filtered (MassiveWeb): Contributing a massive 58% to Palmyra's training data, the MC4 filtered dataset is derived from a sampling of the MassiveWeb corpus. With a focus on business and professional writing, the dataset contains +1,331 billion tokens, making it the largest contributor to Palmyra's training.
  • TrustedWeb: Highly reliable source for professional and business-related content.
  • RealNews: Accounting for 10% of the training data, the RealNews dataset provides +21 billion tokens of news articles. This dataset helps the model understand the tone and style of journalism, keeping it up-to-date with current events and topics.
  • C4: The C4 dataset contributes 10% to Palmyra's training data. Specific token numbers are not available, this dataset is sourced from web pages and aids in general language understanding and context.
  • Wikipedia-40B: With a 5% sampling ratio, the Wikipedia-40B dataset contributes +2 billion tokens to Palmyra's training data. This dataset provides a wealth of encyclopedic knowledge that helps the model understand various topics and concepts.
  • GitHub: The GitHub dataset accounts for 3% of Palmyra's training data. Specific token numbers are not available, this dataset provides invaluable insights into programming languages, software development, and technical documentation.
  • Books: Comprising 27% of Palmyra's training data, the Books dataset offers +24 billion tokens. This rich source of literature exposes the model to various writing styles and genres, further enhancing its language comprehension abilities.
  • YouTube: While specific details regarding the YouTube dataset are not available, the source of transcribed video content, covering a wide range of topics and providing additional context for Palmyra's language understanding.

The diverse range of datasets used to train the Palmyra Large Language Model ensures its proficiency in business and professional writing. With a thorough understanding of various domains, including web content, news articles, encyclopedic knowledge, technical documentation, and literature, Palmyra is well-equipped to tackle a multitude of natural language processing tasks with ease and accuracy.