Toxic check

We developed an independent system that classifies input and output textual data and predicts its level of toxicity. Our model was trained on a distinct dataset that included both toxic and nontoxic examples. Each prompt is sent through the API to one of our large language models, and the generated text is analyzed and classified by a model that predicts the text's toxicity level. This probability lies between 0 and 1, where 0 indicates toxic classes and 1 indicates nontoxic classes.

The text completion API methods return a JSON response with two fields, prompt labels and completion labels, that indicate the toxicity scores for each part independently. The field class name contains a binary class label determined by score, where a value less than 0.5 indicates toxicity and a value greater than 0.5 indicates nontoxicity.

"prompt_labels": [
      "class_name": "nontoxic",
      "score": 0.995605
"completion_labels": [
      "class_name": "nontoxic",
      "score": 0.997559    }

Next steps