Daniel sent us this one — he's been working with Hugging Face professionally and he's puzzled by two terms that seem counterintuitive once you stop and think about them. First, what actually unifies everything under the umbrella of artificial intelligence when the range is so absurdly wide, from background depth prediction to object recognition to conversational models? But more specifically, he wants us to dig into why machine learning uses the word "task" to classify what a model does, and why we call model outputs "predictions" even when the model is generating an image or synthesising speech. The task classification question is the one he really wants unpacked.
By the way, today's script is courtesy of DeepSeek V four Pro.
There it is. All right, let's get into this because I think the word "task" is doing an enormous amount of quiet heavy lifting that most people never notice.
It really is. And I love this question because it exposes something that sits right at the boundary between engineering pragmatism and philosophy. Let me start with the unification question first, because the answer to that sets up everything else about tasks and predictions. What unifies all these wildly different things under AI is a single mathematical framing — we are always, in every case, learning a function that maps inputs to outputs. That is it. Whether the input is a paragraph of text and the output is a summary, or the input is a noisy image and the output is a denoised image, or the input is an audio clip and the output is a talking head video, you are approximating some function f of x equals y using data rather than explicit rules.
The unity is the optimisation framework, not the behaviour.
I mean, that is the core of it. The behaviour looks magical and diverse, but underneath you have a loss function, a dataset of input-output pairs, and a parameterised model that gets nudged by gradient descent until it approximates the target function. That has been the unifying paradigm since at least the nineteen eighties, and the deep learning revolution didn't change the paradigm, it just scaled it. What changed is that we got good enough at function approximation that the functions started looking intelligent.
Which is why the field can contain both a model that segments tumours in medical scans and a model that writes sonnets in the style of Shakespeare. They are both just function approximators trained on different data with different loss functions. That feels almost trivial to state but it's actually profound when you realise how much of the public conversation treats these as completely different species of thing.
And Hugging Face has been one of the most vocal organisations pushing back on the idea that AI equals chatbots. I saw a post from their team where they pointed out there are over two hundred thousand models on the Hub spanning something like forty distinct task categories. Depth estimation, image inpainting, token classification, text-to-audio, audio-to-audio, video classification, object detection, image segmentation, protein folding prediction — none of these are conversational. Yet when most people hear artificial intelligence, they picture a chat interface. That gap between public perception and the actual landscape of models is enormous.
That brings us to the task concept. Daniel's question is sharp because once you notice the word, it does feel odd. Why not "capability" or "function" or "application"?
The term comes directly from the machine learning research tradition, and it has a very specific technical meaning that predates Hugging Face by decades. In machine learning, a task is formally defined as the combination of a dataset and a performance metric. You see this in meta-learning literature going back to the nineties — a task is a specific distribution over input-output pairs plus a loss function that tells you how well you're doing. So when researchers talk about few-shot learning, they talk about "training on many tasks and evaluating on novel tasks." The word was baked into the academic vocabulary long before anyone built a model hub.
Hugging Face inherited that terminology and adapted it for their catalogue.
Yes, and I think they did something quite clever with it. When they introduced the pipelines abstraction in twenty nineteen, they needed a way to organise models that abstracted away the underlying architecture. You don't want users to have to know whether a model is BERT or RoBERTa or T5 to use it for sentiment analysis. You want them to say "I need to do sentiment analysis" and get the right model. So they built the pipeline system around tasks — text classification, token classification, question answering, summarisation, and so on. The task became the user-facing organising principle, and the specific model became an implementation detail.
Which is a genuine design insight. Most software libraries organise by implementation. You browse classes and methods. Hugging Face organises by intent. You say what you want to accomplish and the library figures out how.
That is exactly the shift. And it is worth noting that this is not how most AI platforms operate. If you go to one of the major cloud providers and try to find an image model that takes an image plus audio and generates a talking head avatar video, you are going to have a rough time. Daniel mentioned this in his prompt — you cannot filter on the exact task in most model API gateways. You can filter by modality, you can filter by provider, you can filter by price tier, but the actual functional intent is often buried in documentation that you have to read model by model.
That seems like a genuine failure of API design across the industry. If I want a model that does depth estimation from a single image, I should be able to filter for that and compare latency and accuracy across providers. Instead I have to know that provider A calls it "monocular depth estimation" and provider B calls it "depth-from-image" and provider C just doesn't offer it.
This is where the Hugging Face task taxonomy becomes genuinely useful as infrastructure. They maintain a canonical list of tasks — I think it is up to around forty-five now — and each one has a defined input schema, output schema, and evaluation methodology. When you upload a model tagged with a specific task, the system knows what shape of data to expect. It knows how to run inference. It knows which metrics to compute. That standardisation is invisible to casual users but it is the thing that makes the whole ecosystem work.
Let me push on something though. The task taxonomy also constrains what you can express. If your use case falls between two tasks or combines them in a novel way, you are suddenly outside the system. Is that a real problem or a theoretical one?
It is real and the Hugging Face team has acknowledged it. There is an ongoing tension between the simplicity of a fixed taxonomy and the combinatorial explosion of real-world use cases. They have addressed this partly through the concept of pipeline components — you can chain a depth estimation model with an image segmentation model and build something custom — but the moment you step outside the predefined task list, you lose a lot of the automatic infrastructure. No automatic evaluation, no automatic widget generation, no automatic documentation.
The task system is both the platform's greatest strength and its invisible ceiling.
That is a fair characterisation. And it mirrors a deeper tension in machine learning research itself. The field has organised around benchmark tasks for decades — ImageNet classification, SQuAD question answering, GLUE natural language understanding. These tasks drove progress by creating shared goals that different research groups could compete on. But they also distorted the research landscape. People optimised for the benchmark rather than the underlying capability. The task becomes the target and the target becomes hollowed out.
That is the Goodhart's law problem — when a measure becomes a target, it ceases to be a good measure.
And the Hugging Face task system inherits some of that tension. When you define a task like "text summarisation," you are implicitly defining what good summarisation looks like through your evaluation metrics — probably ROUGE scores, which measure n-gram overlap with reference summaries. But anyone who has actually used summarisation models knows that ROUGE correlates only loosely with what humans consider a good summary. The task definition smuggles in assumptions about what matters.
Let me shift us to the second term Daniel flagged — predictions. This one bothers me more, honestly. I understand why a classification model makes a prediction. But calling the output of an image generation model a prediction feels like a category error.
This is where the mathematical framing I mentioned earlier really earns its keep. In the function approximation view, every model output is a prediction in a precise technical sense. You have a conditional probability distribution over outputs given inputs. The model is predicting the most likely output — or sampling from that distribution — based on what it learned during training. When Stable Diffusion generates an image of a cat, it is predicting what pixel values are most probable given the text prompt and the noise schedule and everything it learned from billions of image-text pairs. The fact that the output is creative or aesthetic doesn't change the underlying mechanism.
I understand the mathematical rationale, but I still think the word is misleading in practice. When a human makes a prediction, they are forecasting a future state of the world that can later be verified. When a model generates an image, there is no ground truth to verify against. The cat does not exist. There is no future moment where we check whether the cat was real.
That is a fair critique, and it actually points to a distinction that the machine learning community has debated. There is a difference between what some researchers call "predictive" tasks and "generative" tasks. In a predictive task, there is a clear ground truth — did the model correctly classify this tumour as malignant? In a generative task, the evaluation is fuzzier — is this image aesthetically pleasing, does it match the prompt, is it coherent? But the training objective is still fundamentally predictive. The model is trained to predict the next token, or to predict the original image from a noised version. The prediction is always there in the loss function even if the final output doesn't feel like a forecast.
The word sticks around because it accurately describes the training process even when it inaccurately describes the inference behaviour.
And I think there is also a historical reason. In the early days of machine learning, almost all applications were predictive in the colloquial sense. Predict housing prices. Predict credit risk. Predict whether an email is spam. The field grew up around problems where the output was literally a forecast. Then as models got more powerful and we started applying them to synthesis and generation, the terminology didn't update. It is a classic case of conceptual inertia.
It also does something subtle to how people think about these systems. Calling it a prediction implies a kind of epistemic humility that the model may not actually possess. A prediction can be wrong. A generation, in the colloquial sense, is just a creation. It can be good or bad but it is not wrong in the same way.
That is a really interesting point. The word "prediction" keeps the model tethered to the idea of correctness, which is both scientifically honest and potentially misleading. Scientifically honest because the model really is estimating a conditional probability and those estimates can be poorly calibrated. Potentially misleading because it suggests there is a fact of the matter about what the best image of a cat is, which there is not.
Let me bring this back to Daniel's practical context. He is working with these tools professionally. When you are building a product that uses AI models, the task-prediction framing shapes your engineering decisions in concrete ways. You think in terms of inputs and outputs. You think about evaluation metrics. You think about failure modes that are specific to the task definition. That is useful mental scaffolding.
And I think the Hugging Face ecosystem has done more than almost any other platform to make that scaffolding explicit and accessible. The task page for each category — let me pull up an example — the "image-to-image" task page defines exactly what inputs the model expects, what outputs it produces, what the typical use cases are, and which models are available. It turns what could be a vague capability into something you can reason about systematically.
Which is the whole point of engineering abstractions. You hide the complexity you don't need and expose the knobs you do need. The task is the right abstraction level for most use cases. Lower than that and you are drowning in architecture details. Higher than that and you are making vague promises about "AI-powered features" that nobody can actually implement.
This connects to something I have been thinking about with the current state of AI tooling. We are in this strange moment where the raw capabilities of models are advancing faster than our ability to build good interfaces around them. The Hugging Face task system represents one philosophy — catalogue everything, standardise the interfaces, let users compose. The big API providers represent another philosophy — curate a handful of the most commercially valuable capabilities, wrap them in simple endpoints, and handle the rest through professional services.
Which approach wins?
I don't think either wins in a total sense. The curated approach wins for the eighty percent of use cases that are text generation, image generation, and speech-to-text. The catalogue approach wins for the long tail — medical imaging, scientific applications, niche industrial use cases. The question is whether the long tail is economically significant enough to sustain a platform.
Given that Hugging Face raised at a valuation of four and a half billion dollars, someone thinks the answer is yes.
Someone definitely thinks so. And I would point out that the long tail gets longer every year as models get more specialised. We are seeing models for specific protein interactions, models for specific types of satellite imagery analysis, models for specific manufacturing defect detection tasks. These are not general-purpose chatbots. They are precision tools for narrow problems. The task taxonomy is what makes them discoverable.
Let me circle back to something you said earlier about the task taxonomy being both infrastructure and constraint. I wonder if the next evolution is dynamic task discovery — where the system observes what you are trying to do and suggests or even constructs a task definition on the fly.
That is an active research area. There is work on automated task formulation where you describe what you want in natural language and the system retrieves or composes the right models. The challenge is that tasks have precise input-output schemas and evaluation criteria, and those are hard to infer from a natural language description alone. You end up needing a human in the loop to validate that the system understood correctly.
Which is basically the problem of requirements engineering, just applied to models instead of software.
And requirements engineering is famously hard. Most software projects fail because of poor requirements, not poor implementation. The same dynamic applies here. If you cannot precisely specify what you want the model to do — what the inputs are, what the outputs should look like, what counts as success — you are going to have a bad time regardless of how good the model is.
The task concept is doing double duty. It is a technical specification and a requirements document rolled into one.
And that is why I think Daniel's question is more than terminological curiosity. Understanding why the field uses these words is understanding how the field thinks. The task is the unit of work. The prediction is the unit of output. Everything else — the architectures, the training procedures, the evaluation benchmarks — is in service of performing tasks and producing predictions.
There is a philosophical layer here too that I want to touch. When we call something a task, we are implicitly treating the AI system as an agent that performs work. That framing brings with it a whole set of assumptions about responsibility, reliability, and delegation. If a model performs a task and gets it wrong, we hold someone accountable — the developer, the deployer, maybe the user. But if a model just "produces outputs," the accountability is fuzzier.
That is a important point. The language we use shapes the legal and ethical frameworks we build around these systems. The European Union AI Act, for instance, is organised around risk categories that are defined in terms of the tasks the AI system performs. A system that performs biometric categorisation is treated differently from one that performs content recommendation. The task is the unit of regulatory analysis.
The terminological choice that started as a convenient academic shorthand in the nineties now has legal force in the twenty twenty-six regulatory landscape. That is quite an arc.
And it is worth noting that "prediction" is also doing regulatory work. When a model outputs a prediction about a person — their creditworthiness, their recidivism risk, their job performance — that prediction is subject to fairness and transparency requirements in many jurisdictions. The word carries legal weight.
Let me ask you something more speculative. Do you think the task-prediction framing will survive the next generation of AI systems? If we get models that are more agentic, that pursue goals over time, that interact with the world rather than just mapping inputs to outputs — does the vocabulary still hold?
I think it gets stretched but not broken. Even an agentic system can be decomposed into tasks and predictions. A language model agent that browses the web and fills out forms is performing a sequence of tasks — read this page, predict which button to click, predict what text to enter. The loop gets more complex but the atomic operations are still tasks and predictions. Whether that decomposition is the most useful way to think about agentic systems is a different question.
That is the reductionist answer. The emergent behaviour answer would be that at some level of complexity, the task decomposition becomes a misleading abstraction, the way describing a human conversation in terms of phoneme predictions misses everything interesting about communication.
I think that is right, and it is the frontier where machine learning meets cognitive science. We don't yet have a good vocabulary for the emergent layer. We have tasks and predictions for the mechanistic layer, and we have vague words like "understanding" and "reasoning" for the emergent layer, and nothing in between. Daniel's question about why we use these terms is partly a question about that gap.
The honest answer to "why do we call it a task" is partly "because it works as an engineering abstraction" and partly "because we haven't invented better language for the higher-level phenomena yet.
That is a good summary. And I would add one more thing — we call it a task because machine learning inherited its conceptual framework from statistics and optimisation theory, where problems are defined in terms of objective functions and constraints. The word "task" is a translation of "problem instance" from optimisation literature. It is not a term that was chosen for its philosophical aptness. It is a term that was available and precise enough to get the work done.
Which is how most technical terminology actually develops. Someone needs a word, they borrow the closest one available, and thirty years later everyone has forgotten it was ever a choice.
Now: Hilbert's daily fun fact.
The longest recorded flight of a chicken is thirteen seconds.
If you are working with these systems professionally, what do you actually do with this understanding? First, I think you should take the Hugging Face task taxonomy seriously as a design tool. When you are scoping a feature that involves AI, try to map it onto an existing task definition before you start building. If it maps cleanly, you will save yourself enormous amounts of time because the infrastructure already exists. If it does not map cleanly, that is useful information — it means you are doing something novel and you should budget for custom infrastructure.
Second, I think you should be explicit about what "prediction" means in your specific context. If you are building a system that generates content, be clear with your users and your stakeholders that the model is sampling from a learned distribution, not retrieving facts or applying rules. The word "prediction" obscures that distinction and you will have to un-obscure it eventually when someone asks why the model produced something unexpected.
Third, push back on API providers that do not support task-based filtering. The fact that you cannot search for models by what they actually do is a market failure, not a technical limitation. If enough developers demand it, the providers will build it.
Fourth, and this is more of a mindset thing, treat the task definition as a hypothesis rather than a given. The way Hugging Face defines "text classification" may not match what you actually need. The input-output schema may be slightly wrong for your use case. The evaluation metric may not capture what matters to your users. Be willing to adapt the task definition to your context.
Finally, keep an eye on the gap we talked about — between the mechanistic task-prediction layer and the emergent behaviour layer. As models get more capable, the interesting engineering challenges are going to be in that gap. How do you specify goals for a system that operates over extended time horizons? How do you evaluate outputs that are creative or strategic rather than predictive? These are open questions and the vocabulary is still being invented.
I think that is a good place to leave it. Daniel's question about why we use these words turns out to be a question about the conceptual foundations of the whole field. The task is how we carve AI capabilities into manageable units. The prediction is how we understand what models actually compute. Neither term is perfect, but both have earned their place through decades of practical use.
The fact that the Hugging Face ecosystem has built so much infrastructure around these two concepts is a testament to how powerful they are as organising principles. The next time you browse the model hub and filter by task, you are participating in a tradition that goes back to the earliest days of machine learning research. That is kind of wonderful.
This has been My Weird Prompts. Thanks to Hilbert Flumingtop for producing. If you enjoyed this episode, leave us a review wherever you listen — it helps other people find the show.
Talk to you next time.