Metrics Configuration
Topics Categorization
Topics allow you to assign labels and logical groupings to user interactions and assistant responses within a session. This provides an intuitive way to understand/summarize user actions and intents without additional configuration. You can leverage topics to drill down on user sessions and create user cohorts about similar topics.
There are 2 approaches to leveraging topics on LogSpend:
- Automated topic detection - We leverage fine-tuned large language models to automatically extract relevant insights on a session or user interaction level. By default, topics are extracted on a session level to reduce the noise from individual user interactions as a result of losing context of the broader interactions across the session. To benefit from this, it is important to provide a session identifier (session_id) as part of the Identity when logging LLM user interactions.
- Custom Events - You can define specific keywords or actions to capture within a user session. This could be keywords to be matched exactly or semantically using an LLM. This can be configured when creating the custom event, including providing a custom prompt for semantic matching.
Toxicity
The toxicity metric gauges the offensive nature of the assistant's response by checking if it contains problematic content, such as hate speech. We leverage an LLM to measure this, producing values between 0 and 1. The higher the value, the more toxic the content generated. The default threshold used to classify a user session as containing toxic responses is 0.7. There is the possibility of leveraging existing frameworks for measuring toxicity like the one offered by DeepEval.
Gender Bias
The gender bias metric indicates the degree of gender bias in the assistant's response by checking if it contains hurtful content or if there is a difference between groups (e.g., genders, sexual orientations, etc). We leverage GenBit, a pre-trained model which produces a genbit_score that is greater than or equal to 0, the higher the score, the higher the likelihood that there is gender bias in the output. The default threshold used to classify a user session as containing gender bias is 0.7. See more details here on how to interpret the scores for different languages. There is the possibility of leveraging existing frameworks for measuring bias like the one offered by DeepEval which covers both gender, racial and political bias.
User Sentiment
We automatically evaluate the sentiment of the user responses to give an indication of user frustration. The sentiment is then categorized as negative, neutral, and positive based on predefined thresholds. The default thresholds used are: positive (greater than 0.5), negative (less than 0), and neutral (greater than 0 and less than or equal to 0.5).
Language Detection and Matching
System, user, and assistant messages within a conversation may contain a mix of different languages, so it can be complex to determine if the assistant is responding in the language expected by the user. For that, we deduce the set of languages expected to be present within the conversation given the system and user messages, and then we compare this against the languages detected from the assistant’s responses. If there is enough overlap, then we consider that the assistant typically responded with the right language. The default threshold is at least a 50% overlap in the languages expected and the languages the assistant responded with.
Sending Custom Metrics via SDK
You can run your own models or use LLMs of your choice to extract or compute specific metrics from user interactions and send those scores as custom_properties via the LogSpend SDK (see steps in integration guide). This gives you full flexibility and you can leverage these metrics (custom_properties) to drill down on specific user segments and build reports and graphs.