Why your Apple Watch (and Garmin and Fitbit) lies about calories

You close the rings. Your Apple Watch says the workout burned 650 active calories. Dinner arrives, and you make a very reasonable bargain with yourself: you earned another 600, so eating them back should still leave the day intact.

Two weeks later, the scale has not moved.

The watch was not lying exactly. It sensed movement, measured your pulse, ran those inputs through a model, and gave you a number. The problem is that people treat the result like a receipt from a laboratory when it is closer to a weather forecast: useful, but not precise enough to settle a calorie deficit on its own.

The Stanford research

The cleanest illustration comes from a 2017 validation study by Shcherbina et al., researchers at Stanford and Cedars-Sinai. They tested seven popular wrist-worn devices while participants walked, ran, cycled, and sat at a desk, then compared them against criterion measures for heart rate and energy expenditure.

The split was stark:

For heart rate, most devices performed reasonably well, generally landing within about 5% of the reference measurement.
For energy expenditure, the best device's median error was 27% (Fitbit Surge).
The worst device's median error approached 93% (PulseOn).
None of the devices achieved an error in energy expenditure below 20%, which is widely regarded as the floor for any practical use.

That is the whole story in miniature. The watches were pretty good at detecting a physiological signal. They were much worse at translating that signal into calories. A later systematic review by Fuller et al. (2020) reached the same broad conclusion across a much larger literature base: steps and heart rate are generally more defensible wearable outputs, while energy expenditure remains the weakest major metric.

Why heart rate is easy and calories are hard

Heart rate is close to a direct measurement. The sensor is not literally seeing your heart, but it is observing a physical signal tied tightly to each beat, usually through photoplethysmography at the wrist.

Calories are different. Your watch does not measure heat production. It estimates energy expenditure from a bundle of inputs such as heart rate, body weight, age, sex, movement pattern, and declared activity type. Then it maps those inputs onto a population-derived equation.

One influential example is the Keytel et al. (2005) approach, which predicts energy expenditure from heart rate while also accounting for variables such as sex, weight, age, and training status. It can work reasonably well in aggregate, especially during steady aerobic work, but it still assumes your relationship between heart rate and oxygen consumption resembles the calibration population.

Often, it does not.

A trained runner and an untrained beginner can hold the same heart rate while burning meaningfully different amounts of energy. Two people of the same body weight can differ in muscle mass, movement economy, stroke volume, and conditioning. Anxiety, caffeine, dehydration, heat, poor sleep, and illness can all push heart rate upward without matching calorie burn.

This is why a device can be good at heart rate and bad at calories at the same time. One number is observed. The other is inferred.

The activity-type problem

Wearables usually do best when the activity looks like the activities they were built and trained to recognize.

Walking and running are the friendliest cases. They have repeatable movement patterns, GPS can help with speed and distance, and there is a long history of locomotion-specific equations.

Strength training is much messier. Heart rate rises during hard sets, but it does not scale with external workload the way it does during steady-state aerobic exercise. In practice, strength-training estimates are often 30-50% off.

Rucking and weighted-vest work create another blind spot. If you weigh 180 lb and carry 35 lb, the watch may know your body weight and walking speed, but it usually does not know about the extra load unless you supply it somehow. Even then, simple MET-style approaches often treat added mass too crudely. For load carriage, the physics are different enough that you are better served by specialized math; Pandolf vs MET explains why, and the rucking calculator gives you a better tool for the job.

HIIT creates a different problem. During intervals, heart rate lags effort on the way up and down, while post-exercise oxygen consumption adds a tail that is real but hard to estimate from wrist data alone. The more irregular the session becomes, the less confidence you should place in one calorie total.

The data the watch does not have

Even a good model is limited by missing inputs. Your watch still does not know several things that materially affect calorie burn:

Your true VO2max, unless you have measured it directly in a lab.
Your fat-versus-carbohydrate oxidation mix at different intensities.
Your specific lean mass.
Your hydration status, stress load, sleep debt, and hormonal state on a given day.

Some devices estimate pieces of this. But estimates layered on top of estimates do not become direct measurement by accumulation.

The same measurement-honesty issue shows up on the food side too. A polished interface does not eliminate uncertainty; it can just make the uncertainty feel less visible. That is the central point in photo calorie counter accuracy, and it applies here as well.

Brand-by-brand patterns

Brand comparisons are tempting because people want to know whether one logo has solved the problem. The evidence is less tidy than that. Algorithms change, device generations differ, and independent studies often lag current products. Still, a few broad patterns are useful:

Apple Watch: generally strongest when the activity is locomotion-heavy, especially walking and running. It is less dependable for strength sessions and tends to miss the extra cost of loaded carriage.
Garmin: often a good fit for endurance athletes because long runs, bike rides, GPS, and structured aerobic training are central to the product line. Outside those domains, it still faces the same physiology problem as everyone else.
Fitbit: usually competent for steps and broad daily activity trends, with simpler consumer-facing calorie estimates that become less trustworthy when the activity departs from ordinary walking or running.
Whoop: particularly valued for heart-rate variability and recovery framing. Its calorie estimates still rely on heart-rate-derived modeling, so they inherit the same class of error.

The practical takeaway is not that one company is honest and another is not. It is that all of them are solving the same difficult inverse problem from incomplete data.

What numbers to trust from your wearable

There are wearable numbers worth using:

Steps: modern accelerometers are usually good enough for trend tracking.
Heart rate during steady activity: one of the most defensible outputs, especially with good contact and regular motion.
HRV trends: the absolute value can be noisy, but your own multi-day direction is often informative.
Sleep trends: sleep staging is rough, but total sleep time, regularity, and directional change can still be useful.

Those are all relative signals. They help you compare you with you.

What numbers not to trust as gospel

Other outputs deserve a wider error bar:

Active calorie burn: treat it as roughly plus or minus 25%, and sometimes worse depending on the workout.
Daily calorie total: this compounds exercise estimates with resting-metabolism assumptions.
VO2max: useful as a trend estimate, not the same thing as a direct metabolic-cart measurement.
Workout-specific calorie totals for non-locomotion activities: especially lifting, rucking, mixed-modal classes, and interval work.

The fix for calorie tracking

The fix is not to throw away the watch. The fix is to put it in the right place in the stack.

Use your wearable as a relative-progress meter, not an absolute judge of your calorie deficit. If today's run shows a similar heart-rate profile at a faster pace than last month, that tells you something useful. If your weekly step count rises, that tells you something useful.

For body-composition goals, calibrate against your actual weight trend over two to three weeks. Start with an estimate, observe what happens to the scale average, and then correct the estimate with reality. Why your TDEE calculator is wrong covers the same principle from the resting-metabolism side: formulas are starting hypotheses, not verdicts.

For loaded walking, override generic wearable math with load-aware equations. Use the rucking calculator when pack weight matters, and read Pandolf vs MET for the model difference.

Most importantly, adopt a burndown view of progress. A daily calorie number is noisy. A multi-week trend is much harder to fool. How to read a burndown chart explains how to anchor the process in observed weight change instead of daily dashboard drama.

The watch is a sensor, not a lab

Your wearable is impressive engineering. It can count steps, track a run, flag a heart-rate trend, and give you a coherent picture of your habits with almost no friction. That is valuable.

It is not indirect calorimetry. It is not a metabolic chamber. It is not qualified to tell you, with meal-planning precision, that you "earned" 600 extra calories tonight.

Treat the watch like a useful sensor and you will get useful information from it. Treat it like a calorimetry lab and it will help you rationalize your way out of the deficit you thought you had.

If you want the number that matters most, use the burndown chart. Over time, your weight trend is the only calorie estimate that has to reconcile with reality anyway.

Citations

Shcherbina, A., et al. (2017). "Accuracy in wrist-worn, sensor-based measurements of heart rate and energy expenditure in a diverse cohort." Journal of Personalized Medicine, 7(2), 3.
Keytel, L. R., et al. (2005). "Prediction of energy expenditure from heart rate monitoring during submaximal exercise." Journal of Sports Sciences, 23(3), 289-297.
Fuller, D., et al. (2020). "Reliability and validity of commercially available wearable devices for measuring steps, energy expenditure, and heart rate: systematic review." JMIR mHealth and uHealth, 8(9), e18694.