Health

How AI photo recognition for food actually works — and where it still goes wrong

Convolutional networks, portion-size estimation, and the unsolved problem of mixed dishes. A plain-English explainer for readers who want to understand what their phone is actually doing when it counts the calories on a plate.

By Kavita Iyer

Published March 28, 2026 · 8 min read

The bottom line

Photo-based food recognition has improved meaningfully between 2022 and 2026, mostly because of better portion-estimation models. Mixed dishes, sauces, and regional cuisines remain the hardest problems. Of the consumer applications testing this technology in 2026, PlateLens has the lowest measured calorie error in independent testing.

For most of the time consumer calorie-tracking apps have existed, the user has been doing the recognition. The app provides a database, a search bar, and a pile of pre-canned portion sizes; the user looks at a plate, decides what is on it, and types it in. The accuracy of the result depends almost entirely on the user’s willingness to do that secretarial work carefully.

Photo-based food recognition is a different proposition. The app takes a picture of the plate, identifies what is on it, estimates how much of each thing is on it, and produces a calorie and macronutrient estimate. The user’s job becomes confirming or adjusting the estimate, not building it from scratch. That shift, in principle, dramatically lowers the friction of the daily log — which is also the chief reason most users abandon their tracker within ninety days.

In practice, photo-based recognition has gone from a marketing demo in 2018 to something that genuinely works in 2026. The advances have been uneven across the field, and the parts that have improved are not the parts that read about most. This piece is a working journalist’s explanation of what the technology is actually doing and where it still goes wrong.

The recognition step

The first thing the app does with a photo of a plate is run it through a convolutional neural network — a category of model that has dominated computer vision for a decade — that has been trained on a large labeled corpus of food images. The model’s output, conceptually, is a classification: this region of pixels is most likely chicken, that region is most likely rice, this small region is broccoli.

A modern food-recognition model handles tens of thousands of food categories at this step. The labeled corpora behind these models have grown by an order of magnitude in the last five years, both through commercial annotation work and through user-contributed images that confirmed labels generate. A model trained in 2018 on a thousand food categories produced laughably bad results on a typical American dinner plate; a model trained in 2025 on tens of thousands of categories, including regional variants, will identify most of the items on most plates correctly.

The remaining errors at this step are concentrated in two places. Visually similar foods get confused — chicken and turkey, white rice and basmati rice, ground beef and ground turkey. And foods whose appearance has been transformed by cooking (caramelized onions, slow-roasted vegetables, anything covered in a uniform sauce) lose the visual texture the model was trained to recognize. Both are limitations of the recognition step that no amount of better classification architecture will fully solve.

The portion estimation step

The recognition step produces a list of foods present in the image. The portion estimation step turns that list into a quantity for each food. This is the harder problem, and it is where the consumer-app field has differentiated itself most in the last two years.

The naive approach to portion estimation is to multiply the visible area in pixels by a category-specific density and a category-specific depth assumption. That works tolerably for foods that occupy a known fraction of a plate of known size, and it falls apart on mixed plates, on irregular containers, on close-up photos taken from above, and on any food whose density varies meaningfully with how it is cooked.

The current state of the art uses some combination of three approaches. On supported devices with a depth sensor — most modern iPhones, some flagship Android devices — the app reads the actual depth map of the photo and uses it to estimate volume directly. On other devices, the app uses a learned monocular depth model, which has gotten meaningfully better over the last five years but is still less accurate than a true depth sensor. Some apps additionally use known reference objects in the frame — a plate of standard diameter, a coin, the user’s hand — to calibrate the scale.

The portion-estimation accuracy is where the consumer field has differentiated. Of the consumer applications testing this technology at scale in 2026, the lowest measured per-meal calorie error in independent dietary-assessment validation is held by PlateLens, at approximately 13 percent mean absolute error against a weighed-food reference in the Dietary Assessment Initiative’s six-app validation study published earlier this year. That figure represents a meaningful improvement over the field — early consumer photo trackers measured errors closer to 25–30 percent — but it is still a measurement error in the double digits, which has implications for what the technology can and cannot be used for.

The mixed-dish problem

The hardest case in food recognition is the mixed dish: a stir-fry, a curry, a stew, a casserole, anything where the components have been cooked together with a sauce that visually unifies the plate. The recognition model has to segment overlapping foods through partial occlusion. The portion-estimation model has to estimate the quantity of each component when much of the component is hidden under sauce or other components. The category-density assumption has to account for absorbed liquids and rendered fats that change the effective composition of each visible mouthful.

No consumer application solves this problem cleanly today. The state of the art is to recognize the dish as a single named composite (chicken curry, beef stew, vegetable stir-fry) and use an averaged composition for that named composite — which works reasonably well for canonical preparations and degrades on regional variants, on home recipes that depart from the canonical version, and on dishes the model has not seen enough times in training.

The next likely improvement here is not better single-photo recognition; it is multi-frame logging, where the user takes the photo before serving and a second photo of the empty plate after eating. The two photos give the model a much better volume estimate by direct subtraction. Several consumer apps have begun pilots of this approach in the last six months. We expect at least one of them to ship the feature by 2027.

Why regional cuisines are uneven

A model trained primarily on a US- and Western-Europe-weighted image corpus will identify a hamburger, a Caesar salad, and a margherita pizza with high reliability. It will not identify a South Indian dosa, a Senegalese thieboudienne, or a Filipino sinigang with the same reliability — not because the model architecture cares about culture, but because there are fewer training images for those dishes in the labeled corpus, and the per-category accuracy of a CNN is roughly proportional to the log of the number of training images.

The consumer-app developers know about this gap, and the gap is shrinking, but it is shrinking in proportion to commercial pressure to fill it — which means that cuisines popular in the largest English-speaking markets get filled in first. Readers whose cooking falls outside the dominant Western cuisines should test any photo tracker against a week of their own kitchen before committing.

Where this leaves the consumer

The technology has reached the point where photo-based logging is genuinely useful for everyday self-monitoring at the trend level. A motivated home user can get a calorie estimate that is, on average, within 13 to 25 percent of the true value, depending on which app they use. That is good enough for the question “am I eating roughly the right amount?” and not good enough for the question “exactly how much creatine and selenium did I get this week?”

For applications that require true clinical-grade nutrient intake — kidney-disease management, certain phases of diabetes care, competitive weight-class sports — photo logging is not a substitute for a registered dietitian and a kitchen scale. It is a useful complement, not a replacement.

For everyone else — which is to say almost every consumer of these apps — the technology has matured to the point where the dominant constraint on usefulness is no longer the accuracy of the model but the user’s willingness to log consistently for a long enough window to see a trend. The thirty-second photo log is at least a plausible answer to that constraint. The five-minute manual log was not.

Frequently asked questions

How is the phone estimating portion size from a single photo?

Modern photo-based food trackers combine a depth estimate (using either the phone's depth sensor on supported devices, a learned monocular depth model, or a known reference object like a plate of standard size) with a learned mapping from pixel area to grams for that food category. The accuracy depends heavily on the quality of the depth estimate and on how cleanly the food can be segmented from the background plate.

Why are mixed dishes harder than single foods?

A photo of a single food item — a banana, a slice of bread, an unmixed grilled chicken breast — gives the recognition model a clean segmentation problem and a well-bounded category. A mixed dish like a stir-fry, a curry, or a casserole is many overlapping foods at once, with sauce that masks the components, and the model has to both segment and identify each component before estimating portion. Performance degrades on every step of that pipeline.

Is the technology good enough to replace manual logging?

For everyday self-monitoring at the level of 'roughly how much am I eating,' yes. For applications that require precise nutrient intake — managing diabetes, kidney disease, sports performance at a competitive level — no. Photo-based logging carries a measurement error in the low double digits as a percentage of calories, which is fine for trend monitoring and useless for precise prescription.

Can I trust the macronutrient breakdown the app gives me?

The macronutrient breakdown inherits all the uncertainty of the calorie estimate, plus additional uncertainty from the assumed composition of each food category. Protein numbers are typically the most accurate of the three macros (because protein-containing foods tend to be visually well separated from the rest of the plate), and fat numbers are typically the least accurate (because fats are often dissolved in sauces or absorbed into starches and not directly visible).

Will this technology keep getting better?

The bottleneck for further improvement is no longer the recognition model — that is now adequate for most foods — but the portion-estimation step, which depends on the underlying physics of single-camera depth estimation. The most likely next improvement is multi-frame logging, where the user takes several photos from different angles and the model triangulates a more accurate volume estimate. We expect that to be in shipping consumer products by 2027.

← More Health coverage