Overview

The Model Comparison Lab is a Streamlit application that provides a user-friendly interface for comparing the performance and capabilities of OpenAI’s GPT-3.5 and GPT-4 models. It is designed for anyone interested in evaluating these models for various use-cases, including research, development, or general curiosity.

Features

Model Comparison Tool: Input a prompt and receive side-by-side responses from GPT-3.5 and GPT-4. Understand the nuances and efficiency of each model in real-time.

Response Evaluation: Rate the responses from each model based on its effectiveness and relevance to your prompts. You can rate the responses from each model and the app will keep track of your ratings.

The rating system provides a simple and intuitive way of evaluating model responses for your purposes. The 👍 button adds one to the running total, the 🤷‍♀️ button adds zero, and the 💩 button subtracts one. Each model’s cumulative rating is displayed for the session. This value will reset to zero if you refresh the page.
You can see your response ratings update in the session history table. If you make a mistake, just click on the button representing your true rating and the value will update in the session history table.
You can also add a comment and save it to a model’s response. You can see your comments update in the session history table.

Session History: View and export your interaction history

You can export your session history to a CSV file for further analysis. By default, the app will only display and let you export your chats from the current session, so if you refresh the page, you will lose your chat history.
- If you would like me to retain an extended chat history on this app for your API key, please let me know.

Cost Analysis: Each response includes a summary of the total tokens used by the system message, the prompt, and the generated response, as well as the corresponding API cost based on OpenAI’s pricing.

Unlike the Assistants API Lab, which uses the tiktoken package to estimate tokens, the Model Comparison Lab does not estimate token counts. Instead, OpenAI’s `completions“ endpoint returns a token count for input and outputs.
Note, the system message in this app is "You are a helpful assistant.", which uses 5 tokens (according to OpenAI’s tokenizer tool).