Can a Question Bank Predict Your Step 2 CK Score? NBME Forms, UWSAs and the Calibration Problem

Behind every score-predictor thread is one question: can practice tell me my real Step 2 CK score? The answer is yes, but only some tools predict well, and confusing them leads people badly astray. Official NBME self-assessments predict far better than a question-bank percentage, every prediction is a range rather than an exact number, and no tool without its own outcome data should claim a validated prediction. Here is the honest hierarchy, the calibration problem underneath it, and what an adaptive engine can and cannot yet claim.

Key takeaways

NBME self-assessments and the Free 120 predict best, because they share the exam's item family and scale.
UWorld self-assessments come next; raw question-bank percentage predicts worst.
Practice percentages mislead because of pool drift, resets, mode effects, and selection bias.
Measurement error means every prediction is a band, roughly plus or minus a handful of points.
An adaptive engine can honestly target weaknesses and retention, but should not claim validated prediction without a study.

The question behind every score-predictor thread

Every forum has the same anxious post: my scores are all over the place, what will I actually get. It is the right question, because Step 2 CK now carries the quantitative weight in residency screening, but it is usually answered with the wrong tool. People average their question-bank percentage, or fixate on one mock, and treat the result as a forecast. The useful move is to know which practice measures actually predict, and how much uncertainty sits around any of them.

The evidence hierarchy

Not all practice scores are equal predictors, and they sort into a clear order. At the top sit the official NBME self-assessments and the Free 120, because they use the same item family and the same scale as the real exam, so they translate most directly and are the backbone of any prediction. Next come the UWorld self-assessments, which correlate strongly but carry their own bias, tending to run slightly optimistic. At the bottom sits your raw question-bank percentage, which correlates only loosely with the real score. Weight your prediction accordingly: NBME forms first, UWSAs as a check, and your cumulative percentage as background, not forecast.

Why percentages mislead

A question-bank percentage is a poor predictor for structural reasons, not because the bank is bad. The item pool drifts in difficulty as you progress and as content updates, so early and late percentages are not comparable. Resetting the bank or doing a second pass inflates the number above your true first-encounter performance. Tutor mode, untimed, scores higher than timed, exam-like blocks. And the percentages people quote online suffer selection bias, since those who post tend to be those who did well, skewing the "average" you compare yourself to. Each effect pushes the number away from what it seems to say.

Measurement error: bands, not points

Even the best predictor gives you a range. Step 2 CK has a standard error of measurement of about 6 points and a standard error of estimate around 8 points for prediction, which means your true performance sits within a band, and a practice test forecasts a spread rather than an exact score. A predicted 248 realistically means somewhere in the low-to-high 240s. Treating a single mock as a precise verdict, or agonizing over a 3-point gap, is reading noise as signal. Think in bands, take several measures, and watch the trend.

Calibration as a personal skill

Prediction is not only about tools; it is about knowing how well you know. Calibration is the match between your confidence and your accuracy, and you can track it: when you answer, note how sure you were, then compare against whether you were right. The highest-yield review targets are your overconfident misses, the questions you were sure of and got wrong, because they reveal false knowledge you would otherwise carry into the exam untouched. A well-calibrated candidate not only scores better but predicts better, because they can feel the difference between knowing and guessing.

What an adaptive engine can and cannot honestly claim

Here is where we draw a line we intend to hold. An adaptive engine can honestly do two things today: target the concepts you keep missing, and schedule retention so you retain them. What it should not claim, without its own outcome data, is a validated prediction of your Step 2 CK score, because a credible predictor requires a correlation study linking practice performance to real results, and we do not yet have one for this engine. So iatroX targets weaknesses and retention now, with free sample questions to try at iatroX, and we commit to publishing a correlation study when the data allows, rather than quoting a predicted score we cannot yet stand behind. The honesty is the point: use validated NBME forms to predict, and an adaptive engine to improve.

A benchmarking protocol you can copy

Put it together into a routine. Take an NBME form early for a baseline, then repeat NBME forms and a UWSA across your preparation under fixed, timed conditions, taking your most predictive forms in the final one to four weeks. Read every result as a band, weight NBME over UWSA over percentage, and average recent forms rather than trusting one. Track your calibration and hammer your overconfident misses. And use your question-bank percentage as a study signal, not a forecast. For the numbers behind the exam, see the 218 standard and the 250 mean, and for reading practice percentages specifically, what a good UWorld percentage means.

Frequently asked questions

What is the best predictor of my Step 2 CK score? The official NBME self-assessments and the Free 120, because they share the exam's item family and scale. UWorld self-assessments are a strong secondary check; your raw question-bank percentage is the weakest predictor.

Why is my question-bank percentage a bad predictor? Because of pool difficulty drift, reset and second-pass inflation, tutor-versus-timed mode effects, and selection bias in the scores people report online. Each pushes the number away from your true standing.

How accurate is any prediction? It is a band, not a point. With a standard error of measurement around 6 points and estimate error around 8, a prediction realistically spans several points, so use ranges and trends rather than single numbers.

What is calibration and why does it matter? Calibration is how well your confidence matches your accuracy. Tracking it lets you find overconfident misses, the questions you were sure of but got wrong, which are the highest-yield review targets because they reveal false knowledge.

Can iatroX predict my Step 2 CK score? Not yet, and it will not claim to without a correlation study. It targets your weaknesses and schedules retention now, and we commit to publishing a prediction study when the data supports it, rather than quoting a score we cannot stand behind.