What are AI tokens?

AI foundations, models and capabilities

AI tokens are the chunks of text, and sometimes other media representations, that an AI model processes rather than whole words or full documents. A token might be a short word, part of a longer word, punctuation, or leading whitespace. Tokens matter because they govern how much a model can read at once, how much an API call costs, how long responses can be, and how easily a system runs out of context.

What this means

When people first use generative AI, they often think in words, pages or files. Models do not. They work in tokens. A token is a smaller unit produced by a tokenizer, the component that breaks text into pieces the model can turn into numbers.

Those pieces do not line up neatly with human language. Sometimes one word is one token. Sometimes a word is split into several tokens. Spaces and punctuation count. Common fragments may become single tokens because they appear often in training data. Less common or longer strings may be broken into smaller parts. Different models also use different tokenizers, so the same sentence can produce different token counts on different systems.

A simple mental model is this: if text is what you write, tokens are what the model can actually see. They are the meter for memory, cost and speed. If you understand tokens, you understand why a perfectly normal looking prompt might be expensive, why a long chat can suddenly lose earlier context, and why moving between providers can change both behaviour and cost.

This is not only about text. In modern multimodal systems, images, audio, video and documents are also converted into internal representations that are measured in token budgets or equivalent usage units. The exact method varies by provider, but the business effect is the same. Tokens are how usage is counted, limited and billed.

Why it matters

Leaders should care about tokens because they sit underneath three things that matter commercially. Cost, latency and context.

Cost is the most obvious. Most APIs price input and output by token. If a workflow sends large prompts, large retrieved passages, or very long responses, the bill rises. That sounds technical, but it quickly becomes an operating issue, especially when teams move from experimentation to scaled internal use.

Latency matters too. More tokens usually mean more processing and longer wait times. That can turn a useful assistant into a frustrating one if workflows are not designed with prompt size and response length in mind.

Context is the third issue, and often the most misunderstood. A model can only work within a finite context window, which is its short term working memory for a request or a conversation. If too much content is added, older information must be dropped, compressed or replaced. Token awareness is therefore central to prompt design, retrieval design, summarisation strategy, and guardrails around long running chats or agents.

How it works

The mechanics start with tokenisation. Before a model processes your prompt, a tokenizer converts the text into token IDs, numbers that correspond to chunks of text. A short common word may map to one token. A longer or rarer word may break into several subword pieces. This subword approach is useful because it keeps vocabularies manageable while still letting the model represent unfamiliar text.

Different tokenizers use different methods. Byte Pair Encoding, often shortened to BPE, has been widely used in GPT style systems. Other systems use approaches such as SentencePiece or unigram language models. The underlying idea is similar. Instead of storing every possible word, the model works with pieces that can be recombined efficiently across languages and domains.

Once text is tokenised, the model processes the sequence within a context window. This is the maximum amount of input and output the model can handle in a turn, subject to provider specific rules. If your prompt, attached context, tool instructions and expected answer together exceed the effective limit, something has to give. The system may reject the request, truncate the prompt, shorten the answer, or silently summarise older context depending on the product design.

Token usage is then tracked in categories. Input tokens are what you send. Output tokens are what the model generates. Cached tokens are previously seen prompt segments that a provider can reuse more cheaply in some architectures. Some advanced models also use reasoning tokens, extra internal work performed before the final answer. These categories matter because they affect both billing and system design.

That is why a short looking request can be larger than it seems. A message may include the user's text, a system prompt, hidden instructions, tool schemas, retrieval snippets, conversation history and files. All of that can count toward the live budget. For developers, token counting is therefore not an academic exercise. It is a way to predict whether a design will be affordable, responsive and robust.

Language matters too. Token counts do not map cleanly to characters or words across scripts. English rule of thumb estimates can be directionally helpful, but they are only rough guides. What looks concise to a human may tokenise quite differently in another language or another model family.

The rise of long context models has changed the conversation but not removed the constraint. A context window of hundreds of thousands or even a million tokens is powerful, but it does not mean "send everything". Long context costs money, slows requests, and can make relevance management harder. A model that can technically read an entire policy library still benefits from good retrieval, chunking and prompt structure.

Token mechanics also affect architecture decisions. In retrieval augmented generation, document chunks must be sized so that useful evidence fits inside the budget without crowding out instructions or answer space. In chat systems, teams often summarise older turns to preserve context. In coding assistants, large codebases may need file selection, ranking and compression. In agent systems, tool definitions and previous traces can quietly consume a surprising share of the window.

Providers now offer tools to count tokens before sending a request. That is an important operational control. It lets teams decide whether to route a task to a larger model, shorten inputs, cap outputs, or use caching. It also stops token spend from becoming invisible until the invoice arrives.

A final practical point is that tokens are not the same as value. Reducing token count is useful where it removes waste, such as repeated boilerplate or irrelevant retrieval snippets. But aggressive shortening can also remove needed context and degrade answer quality. The right goal is not the smallest prompt. It is the smallest prompt that still gives the model what it needs to do the job well.

Examples

A customer service team builds an assistant that includes the entire chat history and several long policy documents in every request. The assistant seems helpful in a trial, but token usage and response times rise sharply once more teams start using it. The issue is not the model alone. It is the token design.

An internal knowledge chatbot retrieves ten large document chunks for every question because "more context should help". In reality, the answer quality worsens because relevant details are buried, the context window fills up, and the system spends money reading unnecessary text.

A multilingual organisation notices that prompts for the same task cost differently across business units. The workflow was designed around rough English token estimates, but different scripts and provider tokenizers change usage patterns.

A coding assistant is pointed at a very large repository. Instead of sending whole directories to the model, the team adds file ranking, chunking and caching. The model becomes faster, cheaper and more reliable because the token budget is being managed deliberately.

Common misunderstandings

A common misunderstanding is that one token equals one word. It often does not. A token may be a word, part of a word, punctuation, or whitespace attached to a word.

Another is that a bigger context window removes the need for careful design. It does not. Long context is powerful, but irrelevant or poorly structured input can still reduce quality while increasing cost and latency.

People also assume token prices are directly comparable across providers. They are only partly comparable because different tokenizers, hidden system prompts, caching rules, reasoning behaviour and multimodal accounting can change what a real workflow costs.

Finally, many teams optimise only the visible prompt. They forget the hidden load from conversation history, tool definitions, retrieval snippets and long outputs.

Risks and boundaries

The main risk with tokens is invisibility. Teams often notice them only when a model truncates, forgets context, or produces a surprising bill. By then, the design issue is already built into the workflow.

There are also portability limits. Switching models can change token counts and effective context behaviour even when the task looks identical. A prompt that fits comfortably in one system may become expensive or unstable in another.

Reasoning models add another layer. They may use extra internal reasoning tokens, which can increase spend and reduce available room for final output if the request is not designed carefully. Multimodal inputs can do the same. This is why token awareness should sit with engineering, product and operations, not only with developer tooling.

What to do next

Ask every production AI workflow four plain questions. How many tokens go in? How many usually come out? What else is being attached behind the scenes? What happens when the budget is exceeded?

Then introduce simple discipline. Count tokens before live requests where possible. Cap output length where quality allows. Remove repeated boilerplate. Cache stable context. Keep retrieval relevant rather than simply large. For chats and agents, decide how conversation history will be trimmed or summarised.

Next, link tokens to spend. Log token usage by workflow, team and model. That makes it easier to see whether costs are driven by a genuinely valuable use case or by avoidable prompt bloat.

Finally, educate non technical owners. Prompt clarity, chunk size, file selection and answer length are not minor implementation details. They are the controls that decide whether an AI workflow remains practical once real usage begins.

FAQs

How many words are in one token?

There is no exact fixed rule. In English, rough estimates are often around three quarters of a word per token, but the real count depends on the model, the tokenizer, the language and the exact text.

Do spaces and punctuation count as tokens?

Yes, often they do. Tokenizers commonly attach a leading space to the next word or break punctuation into separate tokens.

What is a context window?

It is the amount of information a model can work with in a single request or conversation turn. Tokens fill that window, so larger prompts or outputs consume more of the available memory.

What are cached tokens?

They are prompt tokens that a provider can reuse from earlier requests instead of processing from scratch every time. Some providers bill cached tokens at a reduced rate.

What are reasoning tokens?

In some advanced models, they are extra internal tokens used while the model thinks through a problem before producing the visible answer. They can affect both cost and available budget.

Can I compare token prices across vendors directly?

Only carefully. Token counts vary by tokenizer, provider behaviour and workflow design, so the cheapest listed price per million tokens is not always the cheapest real workflow.

Sources