Tournament-Based Performance Evaluation and Systematic Misallocation

Executive Summary

This paper presents the most comprehensive simulation analysis of forced ranking systems in the literature. A 54,000-run Monte Carlo sensitivity analysis spanning 4 performance distributions, 5 team sizes, 6 noise levels, 3 tier counts, and 5 BTL comparison levels produces a single headline result: the overall error rate of forced ranking is 0.494. Forced ranking is, on average, no better than a coin flip.

The cost-of-error analysis translates statistical misclassification into organizational cost. For a $10M bonus pool, the total organizational cost of forced ranking is $11.76M -- comprising $4.58M in direct misallocation (bonuses going to the wrong people) and $7.17M in attrition costs (high performers exiting because they correctly perceive the system as unfair). The system costs more than the bonuses it distributes.

The paper absorbs the previously separate graph-theoretic alternative (Paper 19) as a constructive section, presenting the Bradley-Terry-Luce (BTL) Mirror Supplement as a practical replacement. At k = 5 cross-team comparisons per employee, the BTL supplement achieves a 22% cost reduction while requiring only sparse pairwise data.

Monte Carlo Design

The simulation is designed as a full factorial sensitivity analysis to eliminate the concern that results depend on specific parameter choices:

4 performance distributions: Normal, log-normal, bimodal, and power-law -- representing different theories of how performance is actually distributed in organizations
5 team sizes: 5, 7, 10, 15, and 20 members -- spanning the range of common team structures
6 noise levels: From 0.0 (perfect manager assessment) to 0.5 (moderate assessment error) in increments of 0.1
3 tier counts: 3, 4, and 5 tiers -- representing common forced ranking configurations (bottom 10%, top 20%, etc.)
5 BTL comparison levels: 0, 3, 5, 7, and 10 cross-team comparisons per employee

Each parameter combination is run 100 times with different random seeds, producing 54,000 total simulation runs. This design means that the 0.494 error rate is not a point estimate but an average across the entire parameter space -- it is the expected error rate of forced ranking in general, not under specific conditions.

The Error Rate Structure

The 0.494 overall error rate masks important structure. Error rates vary systematically with parameters:

Team size effect: Smaller teams produce higher error rates (0.56 at team size 5, 0.44 at team size 20) because smaller samples are less representative of the global distribution
Distribution effect: Power-law distributions produce the highest error rates (0.54) because the heavy tail means that a small number of extreme performers dominate the ranking, making the system extremely sensitive to which team those performers are assigned to
Noise effect: Error rates increase approximately linearly with assessment noise, from 0.38 at zero noise to 0.58 at noise = 0.5
Tier count effect: More tiers produce higher error rates because finer distinctions are more susceptible to noise

The critical finding is that there is no parameter combination under which forced ranking achieves error rates below 0.30. Even under the most favorable conditions (large teams, normal distribution, zero noise, few tiers), the system misclassifies nearly one in three employees.

Cost-of-Error Analysis

The cost model translates misclassification rates into dollar values using three components:

Direct misallocation cost ($4.58M): The difference between the bonus pool distribution under forced ranking and the distribution that would occur under perfect information. This represents bonuses paid to lower performers who were ranked highly due to favorable team composition, and bonuses denied to higher performers who were ranked low due to unfavorable team composition.
Attrition cost ($7.17M): High performers who are systematically underranked by the system exit at elevated rates. Using industry-standard replacement costs (1.5-2x annual salary for senior technical roles), the attrition driven by misclassification costs significantly more than the misallocation itself.
Total organizational cost ($11.76M): The system intended to distribute $10M in performance-based compensation instead costs the organization $11.76M in combined misallocation and attrition. The forced ranking system is a net negative investment.

The BTL Mirror Supplement

The constructive section (absorbing Paper 19) presents the Bradley-Terry-Luce (BTL) Mirror Supplement as a practical alternative. Rather than replacing forced ranking entirely (which may face institutional resistance), the BTL supplement adds sparse cross-team pairwise comparisons to the existing within-team ranking process.

The BTL model estimates a global performance parameter for each employee based on the outcomes of pairwise comparisons. At k = 5 comparisons per employee (requiring approximately 15 minutes of manager time per cycle), the supplement achieves:

22% reduction in total organizational cost: From $11.76M to $9.17M
34% reduction in Tier 1 (bottom tier) misclassification: The largest improvement occurs in the highest-stakes decision -- who gets terminated
Diminishing returns beyond k = 7: Additional comparisons provide marginal improvement, making k = 5 the efficiency-optimal implementation

The BTL supplement works because it provides a global reference frame that within-team rankings lack. It does not require full cross-team calibration (which introduces its own error categories) but instead uses sparse sampling to anchor local rankings to a global scale.

Engagement with HR Analytics Literature

The paper engages directly with three streams in the HR analytics literature. Aguinis and O'Boyle (2014) demonstrated that individual performance follows a power-law (Paretian) distribution rather than a normal distribution, which the simulation confirms produces the highest forced-ranking error rates. Scullen, Bergey, and Aiman-Smith (2005) estimated that forced ranking systems misclassify approximately 50% of employees, a finding this simulation replicates with greater precision and across a wider parameter space. Murphy (2020) argued that performance ratings reflect rater effects more than ratee performance, which corresponds to the noise parameter in the simulation and explains why noise amplification is the single most damaging parameter to forced-ranking accuracy.

Key References

McEntire, J. (2025)

Tournament-Based Performance Evaluation and Systematic Misallocation. arXiv preprint. arxiv.org/abs/2512.06583

Aguinis, H., & O'Boyle, E. (2014)

Star Performers in Twenty-First Century Organizations. Personnel Psychology, 67(2), 313-350.

Scullen, S. E., Bergey, P. K., & Aiman-Smith, L. (2005)

Forced Distribution Rating Systems and the Improvement of Workforce Potential. Personnel Psychology, 58(1), 1-32.

Murphy, K. R. (2020)

Performance Evaluation Will Not Die, but It Should. Human Resource Management Journal, 30(1), 13-31.

Bradley, R. A., & Terry, M. E. (1952)

Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4), 324-345.

Lazear, E. P., & Rosen, S. (1981)

Rank-Order Tournaments as Optimum Labor Contracts. Journal of Political Economy, 89(5), 841-864.