Bayeselo Ordo computer chess
Chess engine rating lists look simple on the surface. A table displays engine names, Elo values, scores, and positions in the ranking. Readers may then assume that the list is a direct and neutral reflection of engine strength. In practice, rating lists are the result of a method, and the method matters. The same PGN dataset can be processed in different ways, and different tools can embody different modelling assumptions, output conventions, and practical workflows.
For readers of chess engines ratings lists, one of the most useful methodological distinctions is the difference between Bayeselo and Ordo. Both are widely known names in computer chess rating work. Both are used to turn game results into interpretable rating outputs. Both can help tournament organisers, engine testers, and readers convert raw PGN archives into structured performance information. But they are not identical tools, and they should not be discussed as if they were interchangeable in every detail.
This article explains the role of Bayeselo Ordo computer chess methodology without claiming that one is universally superior. That is an important restraint. There is no responsible shortcut in which one can simply declare a permanent winner between Bayeselo and Ordo in all contexts. The more useful question is narrower: what does each tool do, what kind of assumptions does it make visible, and how should readers interpret rating lists built with one or the other?
Defining the topic precisely
Bayeselo and Ordo are both software tools used to calculate ratings from game results, especially in computer chess. Bayeselo, by Rémi Coulom, is explicitly described as a freeware tool that can read a PGN file and produce a rating list. Its public documentation also explains that the program estimates ratings through a Bayesian framing and maximum-likelihood procedures, while explicitly modelling win, draw, and loss probabilities as well as first-move advantage. (remi-coulom.fr)
Ordo, by Miguel A. Ballicora, is described in its repository as a program designed to calculate ratings for chess engines or players. The README states that it has a concept similar to Elo, but “with a different model and algorithm,” and that it calculates ratings by considering all results at once in order to preserve consistency across the pool. It takes PGN as input and can output ratings in text or CSV form, while also allowing the user to set an average pool rating or anchor a particular player to a chosen value. (GitHub)
At the broad ecosystem level, Chessprogramming’s summary of engine rating lists is useful because it shows that computer chess has long contained multiple independent rating environments, with different time controls and different institutional practices. That same overview also notes that TCEC’s official ratings in archive mode are calculated using Ordo. This is a reminder that the methodological choice is not merely theoretical; it affects public-facing engine lists used by real readers. (chessprogramming.org)
Why the topic matters for IJCCRL readers
This topic matters because a rating list is not just a collection of numbers. It is an interpretation layer placed on top of game results. If readers do not understand the method used to generate the list, they may over-interpret precision, understate uncertainty, or assume comparability where comparability is limited.
For IJCCRL, this matters at several levels. First, readers looking at current engine ratings need to understand that the displayed order is not produced by nature. It is produced by a rating workflow. Second, when readers compare current lists with older events, the method becomes part of the historical record. A rating generated by one tool under one set of assumptions is not necessarily directly equivalent to another rating generated by another tool under another set of assumptions. Third, when historical tournament material is moved into the historical archive, the archive becomes more valuable when its methodological background is clear. Method transparency improves trust. (chessprogramming.org)
In other words, Bayeselo and Ordo matter because they shape how raw results become readable evidence. The difference is not just technical decoration. It affects how responsibly a reader can interpret a ranking table.
What Bayeselo does
Bayeselo is one of the best-known rating tools in computer chess. Its public homepage explains that it reads PGN records and produces a rating list. It also exposes a usage workflow: users load PGN data, enter the Elo subsystem, run the MM procedure, compute distributions, and print ratings. That practical transparency is one reason it became familiar to engine testers. (Rémi Coulom)
The theoretical section is especially important. Bayeselo does more than apply a simple Elo update rule. Rémi Coulom’s explanation states that, to compute the posterior over ratings, the program needs not only an expected score formula but also explicit probabilities for white win, black win, and draw as functions of rating difference. The documentation then defines those probabilities and introduces two notable parameters: one related to first-move advantage and one related to draw tendency. The default values cited on the page were estimated from a large WBEC game sample, and the page gives confidence intervals for those estimates. (Rémi Coulom)
This point is methodologically important. Bayeselo is not only trying to order engines; it is trying to model the result-generating process with more structure than a naive score-only transformation. Its documentation further states that Bayeselo finds maximum-likelihood ratings using a minorization-maximization algorithm. That gives the tool a recognisable statistical identity and helps explain why its outputs may differ from more naive systems. (Rémi Coulom)
Bayeselo’s documentation also emphasises two practical features that are highly relevant for readers. First, it takes colour into account. Coulom explicitly argues that playing White in chess brings an advantage worth roughly 33 Elo points, and Bayeselo models that instead of ignoring it. Second, Bayeselo uses a prior distribution that favours ratings being closer together unless the data justify larger gaps. The page illustrates this with examples showing that a 10–0 score and a 1–0 score should not imply the same evidential strength, even if both are perfect scores in raw percentage terms. In short, Bayeselo is designed to be cautious about large inferred rating separations when the number of games is small. (Rémi Coulom)
That caution is valuable in computer chess, where some readers are tempted to overreact to small samples. Bayeselo’s prior-based behaviour can be interpreted as a methodological attempt to resist overstatement. But that does not automatically make it the right answer for every project. It simply tells us what kind of problem Bayeselo is trying to solve.
What Ordo does
Ordo occupies a different but equally important place in computer chess ratings. Its repository README describes it as a program for rating chess engines and other games, and it explicitly states that although the concept is similar to Elo, the program uses a different model and algorithm. The README adds that Ordo “keeps consistency among ratings because it calculates them considering all results at once.” That sentence is perhaps the clearest introductory description of Ordo’s practical philosophy. (GitHub)
From a workflow perspective, Ordo is straightforward. The user points the program at one or more PGN files, and Ordo generates a ranking. It can output plain text or CSV. It also allows the user to set a target average rating with -a, and to fix the rating of a particular engine as an anchor with -A. These are not minor conveniences. They matter because rating publication often requires a stable reporting convention. A pool may need to be centred on a certain average, or linked to a known reference point so that a published table remains readable over time. (GitHub)
Ordo’s role in public computer chess is also visible indirectly through Chessprogramming, which notes that TCEC’s official ratings are calculated using Ordo. That does not prove universal superiority. It does show that Ordo is trusted in at least one major public engine ecosystem. In methodological writing, this is a relevant fact because it situates Ordo not merely as a theoretical alternative, but as a practical tool that has been used for published competition ratings. (ChessProgramming)
What readers should take from this is not that Ordo must replace Bayeselo, but that Ordo represents a distinct and serious approach to pool-wide rating calculation. Where Bayeselo foregrounds Bayesian language, priors, draw modelling, and explicit colour treatment in its public explanation, Ordo foregrounds consistency across all results and practical usability in converting PGN datasets into coherent tables. (Rémi Coulom)
Why the two tools can produce different-looking outputs
The fact that Bayeselo and Ordo differ in model and algorithm means that the same PGN dataset need not yield identical numerical tables. That should not be surprising. When a rating tool handles priors, colour advantage, draw treatment, scaling, anchoring, or pool constraints differently, the resulting ratings can shift.
Bayeselo’s own site explicitly highlights multiple differences between Bayeselo and older systems such as Elostat. For example, it argues that Bayeselo takes colour into consideration, uses a prior to avoid overrating one-off results, and behaves more reasonably in cases where opponent ratings are far apart. Those examples are not directly comparisons with Ordo, but they show the general principle: rating methods are not neutral containers. Their assumptions affect the output. (Rémi Coulom)
Ordo, meanwhile, emphasises that it calculates ratings by considering all results together and keeping consistency across the rating pool. This tells the reader that Ordo is also not a trivial game-by-game counter. It is a full-pool estimation method. Different estimation philosophies can both be defensible, yet still lead to differences in rank order, rating gaps, and confidence in marginal separations. (GitHub)
This is precisely why users should resist simplistic claims such as “Tool X is correct, Tool Y is wrong.” A more careful statement is that each tool expresses a methodology, and the methodology should be matched to the use case.
Practical guidance for readers
For IJCCRL readers, the most useful practical guidance is interpretive rather than partisan.
First, always identify which rating tool was used. If a published table was produced with Bayeselo, that signals one modelling framework. If it was produced with Ordo, that signals another. Readers do not need to become statisticians, but they should know the label attached to the numbers.
Second, never treat the numeric value in isolation. A rating number gains meaning only when combined with the tournament conditions, time control, hardware pool, opening policy, and game count. Chessprogramming’s overview of engine rating lists reminds us that rating ecosystems differ widely by conditions and by institutional context. (ChessProgramming)
Third, remember that anchoring and scaling choices matter. Ordo’s documented options for setting average rating and anchoring a player are a reminder that published lists often need a reference frame. Bayeselo likewise includes commands such as offset and scale, which show that presentation and calibration choices are part of the workflow. A rating list is therefore not just an invisible calculation; it is also a reporting decision. (Rémi Coulom)
Fourth, be cautious with small differences. Bayeselo’s examples show how strongly inference can depend on the quantity and structure of the data. A one-game result and a ten-game result are not equally informative. Even when a table prints exact-looking numbers, the true evidential gap may be small. (Rémi Coulom)
Fifth, do not confuse rating purpose with event purpose. A rating table summarises estimated performance in a pool. A tournament bracket or final match determines an event winner. Those are related, but not identical, outputs. A site like IJCCRL should therefore keep ratings, events, downloads, and archive surfaces methodologically distinct.
Methodological limits
It is important to state the limits clearly.
Neither Bayeselo nor Ordo eliminates the need for good data. If the PGN dataset is small, unbalanced, mixed across unlike conditions, or contaminated by inconsistent event design, the resulting ratings will inherit those limits. A sophisticated rating tool cannot fully compensate for weak tournament design.
Nor should either tool be interpreted as a universal truth machine. Bayeselo’s public page makes a strong case for its own design choices, but even there Coulom explicitly says he does not claim the tool is perfect and welcomes criticism. That humility is methodologically healthy. (Rémi Coulom)
Similarly, the existence of Ordo in serious public use does not prove that every Ordo output is automatically optimal. Ordo’s documentation tells us what the program does and how it can be used; it does not entitle the reader to collapse all methodological debate into a single slogan. (GitHub)
The responsible conclusion is therefore comparative rather than absolutist. Bayeselo and Ordo are both legitimate parts of the computer chess rating toolbox. Their outputs should be interpreted with awareness of model choice, input quality, and publication context.
Bayeselo, Ordo, and responsible publication at IJCCRL
For IJCCRL, the practical lesson is straightforward. When publishing a rating list, say what generated it. When publishing event results, distinguish event victory from rating position. When publishing historical material, preserve the methodological note so that future readers know what the numbers mean.
This approach improves interpretability in both current and archival contexts. A reader looking at a live or recent rating list should be able to tell whether the numbers come from Bayeselo or Ordo. A reader looking back through older seasons should be able to understand whether a rating series remained methodologically stable or changed across cycles. That is one reason why archive hygiene matters. Ratings without context are weaker as historical evidence than ratings with clear methodological disclosure.
In a serious computer chess environment, transparent method is part of the editorial product.
Conclusion
Bayeselo and Ordo both play important roles in computer chess ratings, but they should not be reduced to a simplistic contest of universal superiority. Bayeselo is notable for its explicit statistical framing, its treatment of colour and draws, and its caution about large inferences from small samples. Ordo is notable for its full-pool consistency approach, practical PGN workflow, and established use in public engine-rating contexts.
For readers, the best question is not “Which one wins forever?” but “Which method was used here, and what does that mean for interpretation?” That question leads to better reading, better publishing, and better historical record-keeping.
A rating list becomes more trustworthy not when it claims infallibility, but when it makes its method visible.
Sources / References
Rémi Coulom, Bayesian Elo Rating.
Official Bayeselo page describing the tool, usage, theory, prior, colour advantage, draw modelling, and MM-based maximum-likelihood estimation. (Rémi Coulom)
Miguel A. Ballicora, Ordo.
Official repository README describing Ordo as a rating program for chess engines or players, with PGN input, pool-wide calculation, text/CSV output, average setting, and anchored reference options. (GitHub)
Engine Rating Lists — Chessprogramming Wiki.
Used for high-level context on independent engine rating lists and for the note that TCEC official ratings in archive mode are calculated using Ordo. (ChessProgramming)

Jorge Ruiz Centelles
Filólogo y amante de la antropología social africana
