Experiments is a powerful tool you can use to test modifications on a subset of users. It allows you to easily try different system prompts, models and rag sources on a selected user samples and visualize how these modifications impact your metrics.

Quick start

Create an experiment is pretty straightforward. Remeber that to achieve a statistical significance the experiment should be deployed on at least dozens of users. Let’s see how you can leverage Nebuly’s platform to run AB testing.

Step 1. Setup the experiment

You can create an experiment from the Experiment & A/B Testing tab on the platform, clicking on the Create new experiment button. Once clicked on the button you have the opportunity to setup the main structure of the experiment:

  • Choose a flag-id. This is a unique ID that will be used in your code to identify which users to apply the experiment to.
  • Specify the hypothesis at the foundation of your experiment. This will be helpful when you will have to decide whether accept or not the modification proposed.
  • Select the primary metrics. Select the metrics will lead your judgement on the changes you want to experiment. Available primary metrics are: number of warnings, negative user actions, negative user comments, poisitve user actions, positive user comments, cost, latency, daily interactions and segment length.
  • Advanced - Select the confidence level. The confidence level expresses the confidence you need that the result you are seeing is not due to a random fluctuaction in the data distribution. Usually is set to 95%. Values below 80% should be avoided.
  • Advanced - Enable sequential testing. This method applies a statistical correction during an experiment to account for the fact that the experiment is ongoing.
  • Experiment End Date. The date the experiment is scheduled to conclude. The experiment can be resumed after this date if desired.
  • Allocation. Allocation is the most important part of the experiment definition. Here, you select three crucial quantities: the target users (the portion of users you might deploy the tested features to if the experiment succeeds), the treatment group (the users who will experience the new feature and have their metrics measured), and the control group (users who won’t see the feature, providing a baseline for comparison). We suggest that the combined treatment and control groups comprise approximately 10-20% of your target users.

Step 2. Choose the variant

Here you can select the kind of experiment you want to run. We propose three alternatives:

  • Add an extra system prompt: Craft a tailored system prompt for targeted users and append it to your model’s current prompt.
  • Route the request to a different model: Analyze which model best suits your use case, considering the trade-off between cost and performance.
  • Add a custom RAG source: Enable your model to call a RAG (Retrieval-Augmented Generation) source for improved responses to user requests. Each personalization offers additional customization options, such as adjusting the model temperature when adding a new system prompt.

Step 3. Fetch variants

In order to fetch the variants and the related personalization parameters, you can use our SDK and exposed endpoints. Further information about them can be found in the SDK docs

Step 4. Track exposure

After setting up the SDK, it’s time to visualize your experiment results. However, before analyzing the metrics (and ensuring their statistical relevance), we perform three essential health checks on the collected data:

  • SDK Verification: We confirm the SDK is configured correctly and transmitting experiment data as expected.
  • Metric Validation: For each metric, we verify that data is present.
  • Balance Assessment: We conduct a thorough statistical analysis of data distribution and p-values. This allows us to identify potential data imbalances that could affect the accuracy of your results. Specific analyses performed on the p-value include:
    1. if 0.001 < p_value < 0.01: The p-value hasn’t reached a level that allows us to assert with high confidence that an actual imbalance exists. It’s advisable to pause and reassess the situation after another day.
    2. if p_value < 0.01 and there is a group (either treatment or control) where the percentage of active users respect the expected size of the group is less than 0.1%: An imbalance may be present, however, the expected impact on the experiment should be minimal. Typically seen in large-scale experiments with user counts exceeding one million, small fluctuations in group performance might result in an uneven but small number of data exclusions among the groups.
    3. if p_value < 0.001 and there is a group (either treatment or control) where the percentage of active users respect the expected size of the group is less than 0.1%: It’s probable that there are issues with the exposure process in the experiment, leading to potential unreliability in the results.

Once all health checks are positive, you can begin analyzing your data. For each metric, we display a confidence interval based on the confidence level you selected in the previous steps. This confidence interval helps you determine the likelihood that any observed increase or decrease in your metrics is merely due to random variation.

If the confidence interval includes zero, it indicates uncertainty about whether the change in the metric is directly caused by the experimented feature. However, the further the confidence interval extends to the right (positive change) or left (negative change), the more confident you can be that the new feature is influencing the results.