AI for Diplomats

Belfer Center Primer: AI for Diplomats

This primer is written for diplomatic professionals to give a basic introduction to the main approaches to artificial intelligence (AI), the ways such systems work, and how they are being applied in diplomacy and security. We set out to explain in practical terms:

Different kinds of AI and how they work at a high level;
The main kinds of AI systems in use today, particularly in diplomacy and government;
Where AI is useful, where it is not, and what kinds of oversight it requires.

Introduction

Artificial intelligence is not new, but it is becoming increasingly central to our everyday lives. The term AI was first coined in the 1950s, and for many years, AI systems were mainly used behind the scenes for tasks like predicting trends or sorting information. Now, with improvements in computing power and data availability, tools like ChatGPT have made AI visible to a much wider audience, but these language-based querying tools only scratch the surface of AI capabilities. The field of AI covers a vast range of computational tools, with so-called large language models representing only part of it.

Daily Interactions With AI

You already interact with AI systems multiple times each day:

When you unlock your phone with your face, an AI system has learned what you look like from many past images and is checking that it’s really you.
When you use a navigation application to find your way home, an AI model is using past and live traffic data to predict the fastest route, and your transit app is using learned patterns of delay to update arrival times.
Your emails, sorted into “spam” or “important,” have passed through classification systems trained on millions of emails to spot fraud, advertising, or genuine messages.
The order of posts in your social media feed has been chosen by AI-based recommendation systems that continuously infer your interests from your habits, whether your past social media usage, your browsing history, or your online ordering activities.

None of these systems truly understands you or your day, but each performs a discrete task. Through complex algorithms, your systems spend the day recognizing, predicting, ranking, and generating information all under the broad heading of AI.

A Potential Future

Based on existing capabilities, a senior diplomat’s day could look like this:

Your daily brief is automatically generated and interactive. Based on thousands of cables and news stories, it can dynamically answer your questions and provide validated timelines, maps, or forecasts within seconds that you can then challenge and refine with staff.
In tabletop exercises, a simulator plays your counterpart, realistically adapting its strategy as you change yours, exposing blind spots in your assumptions and helping to refine your messaging.
During internal or external meetings, each discussion point is tagged and linked to existing policy. By the time you leave, your staff already have AI-drafted accurate and informed follow-up letters and action plans.
You are kept up to date by background systems that trawl public and private data for anomalies—sudden changes in shipping patterns, a new cluster of cyber probes, or a coordinated narrative shift online—and push a short, prioritized “incident strip” to your phone.

None of these tools makes policy on its own, but together they change the breadth, depth, and speed of your work. In short, they augment you. With AI you can see more, sooner, and spend more time deciding what to do with relevant information rather than gathering information and sifting through it.

What We Mean by AI

In this primer, artificial intelligence means computer systems that can perform tasks that usually require human intelligence. Though this is a moving target, examples include:

Understanding and generating language (in response to queries and prompts);
Recognizing patterns in data, images, or audio;
Making predictions or recommendations.

These AI systems conduct two main types of tasks:

Classification/prediction using existing data to determine if a new piece of data is similar to other entries. For example:
When your bank flags an unusual transaction (fraud detection);
When your phone decides if the face it’s presented with is yours (FaceID); or
Generation of new content, such as text, images, or audio, based on patterns learned from their training data. Large language models like ChatGPT and video generation tools are in this second group. For example:
What is the most likely pattern of words that can respond to a given query?
What sort of image would look like a cat?

Main AI Approaches

Today’s AI systems conduct these tasks in three main ways: pattern matching, rule-following, and independent learning. They often overlap in practice, but the distinctions are useful for understanding their capabilities and risks.

1. Pattern Matching – classical machine learning

Classical machine learning uses statistical methods to find patterns in data. A typical setup looks like this:

Humans decide what information to feed into the system (for example: “income,” “country,” “sector,” “prior incidents”).
The system looks for relationships between those inputs and some outcome (for example: “was there a violation?”).
The system learns rules that help it make predictions on new cases.

Because humans choose the key input variables, these systems are often more transparent. It is easier to inspect what the model is using and to check whether it is behaving sensibly. But they are not fully automatic and still require expert design and oversight.

2. Rule Following – symbolic reasoning

Symbolic systems are closer to traditional rulebooks or doctrine encoded into software. They work by applying explicit rules written by humans. For example:

“If the document is expired, then reject it.”
“If a request falls under these criteria, then route it for review.”

These systems are good when the rules are clear and stable. They are easier to audit, because you can read the rules directly. However, they struggle with ambiguity, messy real-world data, and rapidly changing environments.

3. Independent Learning – deep learning

Deep learning is a more recent and powerful approach. Instead of relying on human-designed rules or hand-picked variables, deep learning systems function more independently. They:

Ingest raw data such as text, images, audio, or sensor feeds.
Learn internal patterns by progressively adjusting millions or billions of internal connections.
Use these learned patterns to perform tasks like translation, image recognition, or summarization.

Deep learning powers many of the AI breakthroughs of the last decade, including large language models. These systems often outperform classical methods on complex tasks. However, because they learn independently on large amounts of data, they come with significant challenges:

They are often “black boxes,” which means it is hard to see exactly why the system made a particular choice.
They are data-dependent: If the training data is biased, incomplete, or unrepresentative, the system can amplify those problems.
They can fail in surprising ways when used in contexts that differ from what they were trained on.
They are very expensive to run, due to the large amount of computational power required to process, and learn from, all the data.

How AI learns: What Is Training?

Most AI systems are trained using one or more of three broad learning approaches: supervised learning, unsupervised learning, and reinforcement learning.

1. Supervised learning: The system learns from labelled examples:

A large quantity of sample inputs (e.g., a past shipment, a document, an image) paired with the correct answer that normally comes from human labelling. Example “answers” might include “high risk/low risk,” “approved/rejected,” or “contains hate speech/does not contain hate speech.”
The system then generates measures that are used to characterize correct inputs.
Test inputs are then measured and compared to the known correct samples to determine if the new input is a close enough match to be labeled as such.

Uses include identifying threats or anomalies in operational data, labeling images (e.g., satellite images of damaged buildings), or classifying documents into categories (e.g., topic or sensitivity).

2. Unsupervised learning: Data is not labelled and the system interprets data on its own:

It groups similar items together (“clustering”);
It looks for unusual or rare patterns; and
It identifies underlying structure in large datasets.

Uses include organizing large text collections (e.g., by theme), finding anomalies, or segmenting populations based on behavioral patterns (e.g., travel, communication, or spending).

3. Reinforcement learning: Reinforcement learning is akin to trial and error guided by feedback and requires defining success, typically through scoring outcomes the system should prioritize.

The system takes actions in a simulated or real environment;
It receives feedback (a “reward”) on how good the outcome was and improves its strategy for the next round;
Over many rounds, it learns strategies that tend to lead to better outcomes, which are represented by greater “rewards.”

Many applications train in simulation and then transfer the strategy to real settings with safeguards. This is used in robotics, simulating alternative courses of action, testing strategies in resource management, and supporting decision-making in complex, dynamic environments.

Large Language Models

Language models are a kind of deep learning system designed for text. At their most basic, they are prediction models trying to predict the most likely continuation of “tokens.” A token is a unit that is either a word, part of a word, punctuation, or whitespace, depending on frequency and language. Predicting the next token is, therefore, like a sentence autocomplete function you might have on an email or messaging platform.¹

OpenAI released ChatGPT in November 2022 to demonstrate its general purpose “text in, text out” language model. The chatbot was released as a research preview, instead of a product. Before this point, the model had been used to extract information from documents, conduct basic drafting and editing, summarize documents, and perform basic language analysis.

Language models are trained through three stages:

Pretraining (next-token prediction)

The model reads billions of words and sentences to learn how different parts of words (tokens) fit together and relate to one another mathematically. From this, a sufficiently powerful prediction model can “guess” what the next word should be. They predict tokens one at a time. In modern systems, that next token prediction may be based on the relationships of hundreds of thousands of prior, related tokens.

Instruction tuning (supervised fine-tuning)

The model is given many examples of high-quality answers to the kinds of questions it will have to answer. This might be drafting, summarizing, or information extraction, making it more capable and reliable.

Preference training (RLHF)

Here humans judge the quality of different responses. Answers deemed “good” by some criteria are “rewarded” through reinforcement learning, so the model is adjusted to produce scores more aligned with the criteria. In this context, the model is trying to learn how a human allocates rewards to accrue as much reward as possible. This process is known as reinforcement learning through human feedback (RLHF).

A chat interface is a packaging of the same underlying mechanism: The model receives a block of text and generates a continuation. But a persistent “conversation” with a chatbot is an illusion: Each message is independently generated but is made to feel like a continuous conversation.

Each time an answer is generated, the previous messages are included as an input alongside other relevant context and information. Each message is a separate prediction problem or “query” for the model, so the text you type is only one component of what the model sees.

When you send a query to a model like ChatGPT, the system provides a lot of scaffolding around what you send. This scaffolding is known as “context” to your message. Language models don’t “remember” things; the scaffolding is just included in the context to your message, or “prompt.” The other information includes things like:

System instructions: rules that define the chatbot’s role, safety constraints, truthfulness standards, and information about the model.
Developer instructions: tone, formatting requirements, and organization-specific policies (for example, how to handle classified or sensitive content).
Conversation record: prior user and assistant messages (or a summary of them).
Retrieved references (optional): relevant information pulled from documents, databases, or the web (often called retrieval-augmented generation, or RAG).
Tool descriptions and tool outputs (optional): descriptions of what tools exist (search, databases, calculators) and the results returned by those tools.
The user’s message that is typed and sent from the input box.

This means the message sent to the actual language model has been processed, prepended, and appended. Knowing how to behave, it tries to “predict” the correct response. A bit like roleplaying, the model generates text based on instructions and context to best match learned patterns.

Separately, many systems apply guardrails outside the model (before or after generation), such as:

• filtering or blocking certain inputs;

• preventing certain categories of outputs;

• redacting sensitive strings;

• enforcing logging, audit, or retention rules.

This distinction is important operationally: The “model” is only one component of an end-to-end system.

Where AI Is Used Today in Diplomatic and Security Contexts

AI is already embedded in many analytical and operational systems across government. It supports, but does not replace, human judgment. Below are some examples organized by the underlying approach.

1. Classical machine learning in practice

Cargo Classification Tool (DHS-94)

This tool automatically matches text descriptions of products to tariff codes using classification. It speeds up and standardizes cargo classification, improving both efficiency and accuracy.

Violence Against Civilians Model (Department of State)

This model uses open-source political, economic, and social data to estimate the risk of mass killings of civilians in each country. It supports the Bureau of Conflict and Stabilization Operations with resource allocation in early warning and conflict prevention.

Advanced Network Anomaly Alerting (DHS-105)

This system scans large volumes of network traffic and flags unusual patterns that might indicate cyber intrusions or attacks. It does not, by itself, confirm an attack, but it focuses human attention on suspicious activity.

2. Deep learning in practice

AI-Assisted Translation (Department of State)

Machine learning tools draft translations. Because these models can make subtle errors and may reflect biases in training data, human translators always review outputs.

Automated Damage Assessments (Department of State; retired)

This program used high-resolution satellite imagery and deep learning to automatically assess damage in conflict zones, helping to inform response and accountability efforts.

Deepfake Detector (Department of State)

This tool attempted to distinguish real human faces from AI-generated synthetic ones. It supported efforts to identify manipulated media and protect information integrity.

3. Symbolic reasoning in practice

FOIA 360 AI Matching Tool (Department of State)

This system uses rule-based matching to improve efficiency in Freedom of Information Act processes, such as reducing duplicate work and routing similar requests together.

Key limitations

Even as AI becomes more capable, several important limitations remain, particularly in diplomacy, where situations are highly complex, diverse, and ambiguous. AI should be understood as a support tool:

It can help structure information, find patterns, and generate options;
It can increase speed and scale in monitoring and analysis; and
It cannot yet wholly replace human intuition, ethics, relationships, and judgment.

AI systems, including advanced models, currently:

Do not understand politics or strategy in a human sense. They detect patterns in data.
Do not reliably capture values as humans do, ethical commitments, or political red lines unless these are explicitly encoded or consistently reflected in their training data.
Do not possess novel strategic foresight: They extrapolate from patterns rather than genuinely anticipating novel moves in a highly adversarial environment. Humans also don’t possess this, but models are not expressions of truth.

There are also some subtleties in diplomatic contexts. AI models still:

Struggle with subtle political cues, shifting alliances, and informal power structures.
Struggle with cultural nuance, indirect communication, and unspoken signals.

Many of these shortcomings exist in the models themselves, the mathematical “black box” that provides a response, but many teams build scaffolding to support the models in their failure modes. For example, though a model can’t keep track of the date or large amounts of information, we can provide them with access to databases to prevent them from making up or “hallucinating” information.

Governance and Oversight in Diplomatic Use

Using AI in diplomatic and security settings raises specific governance questions. Three areas are especially important: testing, transparency, and security.

1. Testing: Before deployment, systems need robust testing to answer questions such as:

Under what conditions does the system perform well, and where does it fail?
How often does it produce incorrect or misleading outputs?
How does performance change across different regions, languages, or populations?
Can adversaries manipulate the system (for example, by feeding it misleading data or probing it for sensitive information)?

Ongoing AI tool evaluation is essential, because systems may drift over time as data and environments change.

2. Transparency: Diplomats, partners, and citizens need to understand:

What the system is for (intended use).
What data it was trained on, at least at a high level (types of data, sources, time ranges).
What its known limitations are (e.g., weaker performance for certain regions, languages, or groups).
Who is responsible for decisions influenced by the system’s output.

Clarity on these points helps assign responsibility, build trust, and avoid overreliance on tools that were never designed to function beyond their defined scope.

3. Security: Governments must ensure AI systems:

Show acceptable levels of bias for their deployment context.
Are resilient to misuse.
Protect sensitive and classified information, avoiding data leaks and unauthorized access.
Are deployed with clear rules of engagement and monitoring for unintended consequences.

As diplomacy integrates AI tools, maintaining public and international confidence will require:

Guarding against misinformation and deepfakes.
Ensuring automated tools do not inadvertently escalate tensions.
Demonstrating responsible, transparent use in both domestic and international forums.

Looking Ahead

Several trends are likely to shape the future of AI in diplomacy and security:

Interoperable systems: This means that models and AI platforms will increasingly be able to interact, pass information between one another, and coordinate tasks.
Multimodal systems: This refers to systems that combine text, images, audio, video, and structured data within a single model.

AI will likely become a standard part of the diplomatic toolkit, with powerful applications for analysis and coordination, but dependent on clear human leadership to ensure it is aligned with political objectives, legal obligations, and ethical commitments.