Creating test suites - bench documentation

bench.readthedocs.io

RelatedHighlights

Take a look at our official page for user documentation and examples: langtest.org

Key Features

Generate and execute more than 50 distinct types of tests only with 1 line of code

Test all aspects of model quality: robustness, bias, representation, fairness and accuracy.

Automatically augment training data based on test results (for select models)

GitHub - BrunoScaglione/langtest: Deliver safe & effective language models

Nicolay Gerold added

The goal of a benchmark usability test is to describe how usable an application is relative to a set of benchmark goals.

Jeff Sauro • Quantifying the User Experience: Practical Statistics for User Research

You want to ensure that you run your tests for at least a hundred conversions on each variation, but the exact number might be unique to your own website.

Alex Harris • Small Business Big Money Online: A Proven System to Optimize eCommerce Websites and Increase Internet Profits

Test systems at production scale:

Amazon Web Services • AWS Well-Architected Framework (AWS Whitepaper)

Fluent Builders in Automated Tests

Blake Norrish medium.com

added

Objective #3 (20%): Execute at least 15 unique tests during the fiscal year. Document what was learned in each test, and share results with the executive team.

Kevin Hillstrom • Hillstrom's Email Marketing Excellence

The size of your test will be constrained by the traffic to your landing page and its data rate (the number of conversion actions per unit time). Changing the granularity of your tests allows you to include all or most of your important ideas while still fitting into a reasonable test size.

Maura Ginty • Landing Page Optimization: The Definitive Guide to Testing and Tuning for Conversions

A new v0.4.0 release of lm-evaluation-harness is available !

New updates and features include:

Internal refactoring

Config-based task creation and configuration

Easier import and sharing of externally-defined task config YAMLs

Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource

More advanced configuration opt

GitHub - sqrkl/lm-evaluation-harness: A framework for few-shot evaluation of language models.

Nicolay Gerold added