1 February 2024

We Need to Talk About FPS Metrics Reporting... (Part 2)



There's a well-known idiom that's often said: "There's lies, damn lies, and then there's statistics...". This implies that the "statistics" in question are another, worse form of lie that is somehow obfuscated from the receiver of the information.

We also have multiple well-known sayings which revolve around the concept of, "you can make the statistics/data say anything you want". It seems readily apparent that people, in general, do not like or trust "the statistics".

I experience this, in my own way, in my day-to-day work. Scientists are currently not the most trusted of individuals - for whatever reason - and one of those reasons, in both cases, is a lack of understanding on the part of the consumer of the results of data analysis, both within and outside of scientific circles.

In the same way people say "science is hard", people say "statistics" is hard... and this is for good reason - though it might not be for the specific reason that might immediately spring to mind!

Statistics is not that difficult once you know what you are doing (at least in my opinion). The difficult part is knowing which statistical test to apply when and where. Yes, the difficulty, as when designing scientific experiments, is understanding the context, limitations and biases of what and how you wish to test.

This is why there are many statistical tests where the number of data points needs to be below or above a certain limit; why it is important to know the relationship between the individual data points and the set as a whole; and how the interpretation of the result of the analysis might be changed based on myriad factors.

Hence, we come to today's topic for discussion: hardware performance testing in games!

Last time, I attempted to communicate the shortfalls and incorrect analysis being performed in the industry at large. Admittedly, I was unsuccessful in many ways and was roundly dismissed by most parties...

Today, I will try a different tack.


Just an example of an average and percentile result... (Tom's Hardware)



Porkies...



Let's talk about statistical analysis! Yes, I can see you falling asleep already but this is central to my point today. One of the innovations discussed last time was the application of very light statistical analysis tools to frametime data.

We analyse the average framerate, along with the lowest framerate and/or the percentile lows (as picked by the specific outlet). I tried to point out that the method chosen to do this is quite literally wrong. However, I came at it from the perspective of a reviewer trying to pull together the data to make a story - who may or may not have any more statistical training than I do. 

This time I want to come at the problem from the point of view of statistical logic. 

First let's acknowledge what already works: average framerate. This metric is well-understood - it is being applied correctly. But it was the desire to more clearly understand the experience of playing games on specific hardware which caused the industry to move to look at the lowest framerates as these can have an outsized impact on the experience in the form of stutters or incorrect frame pacing.

So, how was this approached?

The industry, as a whole (at least as far as I can tell), has done this by doing two things:
  • Assuming that the dataset of frametimes is normally distributed (or close to it). 
  • Directly converting the individual frametimes into a framerate value (or fps).

For example: By taking the 1 % low fps, the reviewer interpreting the data gathered during a benchmark will form a distribution of the dataset and take the 99th percentile (corresponding to the longest frametimes), then convert that value into an framerate (fps) metric.

There are two issues with what has been done because frametimes are temporal data:
  • The order or sequence of their situation in the data is important.
  • A frametime is not a framerate.

By taking the frametimes and rearranging them to be near to other similar frametime values to form a distribution, you are destroying the relationship between one frame and the previous and next frames in the sequence.

For anyone who is familiar with a certain level of mathematics, this is like confusing the application of permutation (nPr) and combination (nCr) when attempting to work out the number of possible arrangements of a set of data.

Now, many people will argue that it doesn't matter and that you just want to see the worst performance in any given benchmark. However, this is where we hit the second problem - a frametime IS NOT a framerate.

If we take the analogy of a car journey, the framerate is the averaged speed over a period of time. If you divide the distance travelled from home to work by the time you took to make the journey, you will get your average speed - aka the average framerate.

However, this is where that analogy breaks down because an individual frametime is not like a point speed value. You can extrapolate how much distance you will cover per unit time if you continue on at that same frametime value. Unfortunately, this is almost never the case in situations where we have unlocked framerates - the frametimes will vary, potentially by quite a lot.

In this analogy, there is no equivalent for frametimes. The closest we can get* is the difference between sequential frames, which is like the derivative of the speed value - i.e. the magnitude of your acceleration.

*That I can conceptualise!

A frametime is the time to deliver one frame. The only analogy I can come up with is related to the speed of light. The speed of light is a constant. However, that constant changes depending on which medium the photon* is passing through. Thus, a frametime is like the distance travelled by a photon where the medium it is passing through changes for each frame presented.

*Let’s assume it’s a photon.

The speed of light in a vacuum is that upper limit constrained by the game engine but every frame has to travel through different media. Sometimes it’s glass, sometimes it’s a gas, sometimes it’s the human body. Occasionally, the media between sequential frames is the same or very similar material - but often it is not.

We can still perform the average distance over time as we did in the car analogy (i.e. framerate) but we end up with these individual instances (frames/photons) each travelling a different distance that make up the whole each travelling at the speed of light, only that speed is different.

Each point speed is not an average - it is a constant (that is constantly changing). Thus, you cannot change that value from a constant into an average. I.e. you cannot directly convert a frametime into a framerate (fps) because (in this analogy) the photon will not be travelling the same distance per unit time.


Previously, I tried to show visually that the framerate is a moving average... 



Flawed analogies...



Now we can move back to the first point - the idea that the order of frames is important.

Let’s make another analogy! (They’re fun!)

If we collected the data on the height of all men in Sweden and we wanted to know the average height of the male population, we’d sum all the values and then divide by the number of data points.

Now, if we want to find out the 1% lowest height in the population, we’d arrange that same data in a distribution and find that percentile (even if it landed between two data points) and report the value.

There is nothing wrong with this application of statistics.

The reason why there’s nothing wrong with it is because none of the data is related to the other - there are no dependencies. The height of male #1 has no relationship to the height of male #322, or #4536 other than we will measure them using the same units.

For frametimes we actually have the same situation, if we’re just looking at the data in isolation, there is no relationship between frame #1 and #8764. However, in reviewing the performance we are doing something very important - we are assessing the user experience by assessing the relative performance of the hardware in question over a period of time (the benchmark).

Thus it becomes important which frame follows which frame and where in the benchmark the frame is located.


This is another issue with focussing on the percentile data: it does not show you how impactful that value actually is!

Let’s assume for a moment that we accept the percentile data is an acceptable way to analyse the performance of PC gaming hardware and that directly converting frametimes into framerates is also acceptable:

If we keep all other hardware variables the same and GPU 1 gives you a 1% low of 31.5 fps while GPU 2 gives you a 1% low of 35.6 fps. You’d say that GPU 2 is better, right?

Well, what if I show you the frametime graphs of those two benchmark runs?


The percentile lows fail to capture the experience and especially the magnitude of the experience... GPU1 (left), GPU2 (right)


Would you still say GPU 2 is better?

What if we have another situation where this time we’re doing CPU comparisons. CPU 1 gives you a 1% low of 35.4 fps whereas CPU 2 gives you 34.6 fps? Which is the better CPU?


CPU1 (left) gives you a large hang when passing through a loading area. CPU2 (right) has stutters as the enemy AI engages with you during a fight...

Well, that's the thing! CPU2 is stuttering less in terms of magnitude but at an important part of the gameplay - combat. CPU1 has a much worse stutter but it's during the loading of a new area - which won't kill you...

From these simplistic graphs, I hope you can see that these percentile values are almost meaningless because the sequence of events is important when playing a game! All context has been stripped from the data by applying an incorrect statistical interpretation...

Worse still, any outlet using 0.5% and 1% percentile lows will be reporting values which will most likely not represent the real order of performance between products (ignoring the fact that what they are doing is wrong in the first place!).

One of my suggestions was to look at the differential between sequential frames then define a limit to see how many excursions happened...



Finale...


And that, I think, does it for today. In the previous post, I went over what I thought could be an improvement in reviewing procedure. Maybe next time, I will come back to that and see if I can make it even more user-friendly.

Thanks for reading!

2 comments:

DavidB said...

Yeah, the challenge is reviewers feel compelled to present some graph/chart showing how one compares to others, is the reviewed product "faster" or "better" than others. If they don't, they'll be roasted in the comments for not providing an understandable comparison the "average" (i.e. not techy nerds) reader/viewer can comprehend to make a A versus B buying decision.

I've read the other article too, and I'm not sure there's a solution that satisfies the "scientist" while still being digestible by the mainstream. But that's what you're after, which I applaud.

Duoae said...

Yeah, i know what you mean. The percentile lows are easily understandable and well-ingrained as a concept.

But i think that the only reason they're accepted by the mass market is because they're ingrained, not because they're the best.
I believe there must be a way to put better data into a simple bar chart but, as you say, still working on it! 😅