Large Language Model (LLM)
LLM · foundation model · generative model
A neural network trained on vast text corpora to predict the next token, used as the engine behind every modern AI assistant.
A large language model is a transformer-based neural network trained on internet-scale text to predict the next token in a sequence. GPT-4, Claude, Gemini, Llama, and Phi are all LLMs. The model itself does not have memory, opinions, or real-time knowledge — those properties are added by the application built around it.
For SMB buyers, the practical implication is that the model is rarely the differentiator. Two AI products built on the same model can produce very different outcomes depending on the prompt design, grounding sources, governance posture, and adoption mechanics.
Retrieval-Augmented Generation (RAG)
RAG · grounded generation · context augmentation
The pattern of fetching relevant documents at query time and feeding them to an LLM so the answer is grounded in real, current data.
RAG is the dominant pattern for building business AI applications. At query time, the application searches a content store (SharePoint, a vector database, a SQL table) for relevant snippets, then includes those snippets in the prompt sent to the LLM. The model uses them as context when generating the answer.
Most enterprise AI assistants — including Microsoft 365 Copilot, most Copilot Studio agents, and most Azure AI Foundry agents — are RAG systems under the hood. The quality of the retrieval (right documents, right chunks, right ranking) usually determines the quality of the output more than the choice of model.
Grounding
grounded answers · tenant grounding
The act of anchoring an AI response in specific, retrievable source documents so the answer can be verified.
Grounding is what separates a useful business AI assistant from a generic chatbot. A grounded answer cites specific source documents — SharePoint pages, contracts, knowledge-base articles — so the user can verify the claim. Microsoft 365 Copilot grounds in the tenant; Copilot Studio agents ground in whatever connectors you wire up.
Ungrounded generation is appropriate for some tasks (drafting, brainstorming, summarising provided text), but anything that touches a customer, a number, or a policy decision should be grounded. The single biggest source of Copilot-related risk is users treating ungrounded answers as if they were grounded.
Hallucination
fabrication · confabulation
When an AI model generates a plausible-sounding answer that is factually wrong or unsupported by its sources.
Hallucination is the failure mode where an LLM produces output that looks confident and reasonable but is wrong. The model is not lying — it has no concept of truth — it is producing the statistically likely next tokens. When the training data thins out or the prompt is ambiguous, those tokens cease to correspond to reality.
Hallucination cannot be eliminated, only mitigated. Grounding, citation requirements, retrieval quality, verification UI ("show me the source"), and user training all reduce the rate. For SMBs the highest-leverage mitigation is usually training: teach users to verify before they send.
Prompt engineering
prompting · prompt design
The practice of designing the instructions sent to an LLM to consistently get useful output.
Prompt engineering is the discipline of writing the instructions — system messages, user messages, examples, constraints — that elicit good behaviour from an LLM. For end users it shows up as "the way I ask Copilot for a draft." For builders it shows up as the multi-section system prompt that defines an agent’s persona, scope, refusal behaviour, and output format.
At SMB scale the highest-ROI prompt-engineering work is usually building a shared prompt library: the 30–50 prompts that match how your teams actually work, pinned somewhere visible. It is more durable than chasing model upgrades.
Context window
token limit · context length
The maximum number of tokens (roughly words) an LLM can consider in a single request.
Every LLM has a context window: the upper bound on how much text it can read in one go. Modern models range from around 8,000 tokens to several million. Tokens are roughly three quarters of a word in English.
For business users, the context window is mostly invisible — Copilot and other assistants handle truncation under the hood. For builders it matters: it sets the limit on how many retrieved chunks a RAG pipeline can pass to the model, and it influences cost (more tokens in = more dollars per query).
Token
The unit of text an LLM processes — roughly three quarters of an English word.
Tokens are the chunks an LLM operates on: sub-word pieces produced by a tokenizer. "Hello" is one token; "tokenization" is three. As a rough rule of thumb, 1,000 tokens is around 750 English words.
Tokens matter commercially because most pay-as-you-go LLM pricing is per million tokens of input and output. They matter operationally because they define the context window and the upper bound on a single request.
Embedding
vector embedding · text embedding
A numerical representation of text that captures meaning, used to power semantic search and retrieval.
An embedding is a vector — typically several hundred to several thousand floating-point numbers — produced by an embedding model from a piece of text. Texts with similar meanings end up near each other in vector space, which is what makes semantic search work.
Embeddings are the backbone of RAG systems: when a user asks a question, the question is embedded, the system finds the nearest document chunks in vector space, and those chunks are fed to the LLM as context.
Fine-tuning
model fine-tuning · supervised fine-tuning · SFT
The process of further-training a base model on a smaller, task-specific dataset to specialise its behaviour.
Fine-tuning takes a pre-trained base model and continues training it on a smaller, curated dataset. The result is a model that is biased toward the patterns in the fine-tuning data — a specific tone of voice, a specific output format, a specific domain.
For SMBs, fine-tuning is almost always the wrong first answer. RAG, better prompting, and better retrieval will deliver more value at a fraction of the cost and ongoing maintenance burden. Fine-tuning is appropriate when you have stable, repetitive, high-volume use cases where prompt engineering has plateaued.