Designing effective GenAI agents requires marrying AI capabilities with established pedagogical principles and interaction design frameworks. In this section, we outline key design principles for LLM-powered educational agents, including transparency of AI decisions, scaffolded questioning techniques, multimodal engagement, and maintenance of learner agency. We then examine how these principles are implemented in practice, referencing frameworks like ARCHED and the structured prompt templates (often JSON-based) used in the Socratic Playground system (see Box 3.1). By emphasising features such as conversational pacing, synchrony between verbal and non-verbal cues, and interactive learner controls, we showcase how GenAI agents can balance open-ended dialogue with instructional rigor, thereby fostering learner trust and autonomy even while the interaction is AI-driven.


As AI tutors become more complex, transparency in their operation is crucial for building trust with both learners and educators. The ARCHED framework proposes a human-centred approach that embeds transparency and human oversight into AI-assisted instructional design (Li et al., 2025[21]) (see Box 3.1). Within the framework, multiple specialised AI components recommend pedagogical actions and evaluate them, while human educators remain the ultimate decision-makers, ensuring the reasoning behind AI-generated content is visible and can be vetted. Translating this to a digital tutoring agent scenario, a GenAI agent should ideally be able to explain why a particular question is asked or why certain feedback is given – or at least do so if the learner inquires. For example, an agent might preface a hint by explaining that it is intended to support clarification of the learner’s understanding of a specific concept. Such meta-dialogue provides insight into the agent’s pedagogical intent. Another aspect of transparency is indicating uncertainty. If the AI is not fully confident in a response (which can be estimated from model probabilities or a validation step), it can disclose that uncertainty – e.g. “Let’s double-check this answer, as I’m not entirely sure”. This honesty can help set the right expectations and invite joint problem-solving, rather than the student taking every AI statement as gospel. Several modern systems have introduced mechanisms to promote their transparency and explainability (See Box 3.2). Alternative to the mechanisms used in existing tools, the implementation of generative pedagogical agents can
incorporate mechanisms for post-hoc validation of interactions. In designing SPL, the developers introduced a
logging and visualisation tool for researchers and instructors that showed the agent’s decision path (e.g. which
prompt pattern was triggered; what the agent “thought” the student’s misconception was). While this backend
transparency is not directly available to learners, it allows continuous human oversight of the AI’s pedagogical
actions. Overall, adopting a transparent design entails making both the system’s internal reasoning and its external
interactions as interpretable as possible, aligning with calls for trustworthy AI in education,


A cornerstone of AI-powered pedagogical agent design is the use of scaffolded dialogue, often drawing
from Socratic questioning and related strategies. Rather than delivering answers outright, a well-designed
AI-powered pedagogical agent guides the learner to construct knowledge through carefully sequenced questions.
This approach is rooted in Vygotskian scaffolding and the Zone of Proximal Development, where support is provided
just beyond the learner’s current ability and gradually withdrawn as competence grows (Vygotsky, 1978[11]). LLMbased agents are particularly well-suited to implementing Socratic questioning, as they can generate an extensive
range of probing questions and follow-ups dynamically. They can also flexibly rephrase or adjust the difficulty of
questions based on learner responses.
Frameworks for intelligent tutoring often include taxonomies of questions (e.g. conceptual probes, evidence requests,
counterfactual prompts) that can be encoded into the AI’s prompt or decision logic. In practice, SPL uses a JSONbased prompt template to enforce a structured tutoring script while still leveraging generative flexibility (Hu, Xu
and Graesser, 2025[4]). The prompt is divided into sections, such as “Initial_Interaction”, “Following_Up”, “Providing_
Feedback”, etc., each containing guidance for the type of Socratic moves the agent should make. For example, in a
“Following_Up” turn, the agent might be instructed (via the prompt) to ask a “why” question related to the last student
statement, or to request clarification if the student’s answer was incomplete. By structuring the interaction in this
way, the agent’s generative outputs remain pedagogically purposeful. More importantly, the JSON structure also
allows the system to track expectations and misconceptions explicitly, that is the agent keeps lists of the key points
(“expectations”) the student should mention in an ideal answer as well as known common errors (“misconceptions”).
Each student response is compared (via the LLM or supplementary classifiers) against these lists, and the subsequent
prompt is generated accordingly – e.g. if a misconception is detected, the following question might target that
misunderstanding. This method, inspired by AutoTutor’s expectation-misconception tailoring (Graesser et al., 2005[1])
but modernised with LLM capabilities, ensures the question scaffolding is adaptive to the learner’s input. Empirical
studies have long shown the effectiveness of such scaffolding as it keeps the learner in an active constructive mode
rather than a passive one, which is known to enhance learning outcomes (Chi, 2009[27]).
Adopting the scaffolding approach in agent design aligns with a broader body of research that aims to leverage
LLM-driven agents to foster deeper understanding and self-directed learning (Córdova-Esparza, 2025[5]). In
designing a GenAI agent, educational technology developers should thus curate a bank of pedagogically sound
questioning strategies and incorporate them either through prompt patterns, few-shot exemplars, or rule-based
overlays on the LLM’s output.

To truly advance beyond single-modality interactions enabled by traditional AI tutors, GenAI agents can leverage
multimodal engagement – combining text or speech with other modalities like visuals, gestures, or interactive
simulations. Research in multimedia learning has shown that well-coordinated verbal and visual information can
enhance understanding, as long as they are synchronous and not overwhelming. Modern AI
platforms allow a tutoring agent to display images, diagrams, or even manipulate virtual objects in a simulated
environment alongside the dialogue. For example, if a student is learning geometry, the agent might dynamically
generate a diagram of a triangle and mark angles as it guides the student through a proof. Generative models can
produce descriptions of visuals or request relevant images (via integration with image search or generation models),
effectively acting as a bridge between text and visuals. Furthermore, if the agent is instantiated as a virtual tutor –
whether through AR/VR or screen-based interfaces – the alignment of facial expressions and gestures with dialogue
constitutes an important factor in achieving natural interaction. A nod or encouraging smile rendered on the AI
tutor’s avatar can reinforce the tone of the agent’s message (e.g. affirming the learner’s progress). Yet, it is worth
emphasising that the timing of these cues should align with the conversational content to avoid cognitive dissonance.
The Socratic Playground’s current implementation is primarily text-based with a simple animated avatar representing
the AI pedagogical agent, but the design guidelines call for gesture-text synchrony in future versions – for instance,
having the avatar produce a “thinking” expression when posing a difficult question, or a cheerful expression when
giving positive feedback. The literature on embodied conversational agents suggests that
such non-verbal behaviours, when congruent with the dialogue, can increase learner engagement and trust in the
agent. Nonverbal behaviours are used in several innovative platforms: the DALverse project establishes an inclusive
metaverse environment for distance education, where students can engage as digital avatars in multimodal learning
tasks, leading to increased engagement and retention in distance learning settings. The design implications are clear: GenAI agents should, whenever possible, be integrated into interfaces that leverage
multiple modalities (e.g. text, voice, graphics) to enable richer learning interactions. However, designers must adhere
to established principles of multimedia learning to ensure that these modalities complement rather than compete
with one another – for example, by avoiding extraneous animations or redundant narration that merely reads onscreen text aloud, both of which can contribute to cognitive overload.

A frequent criticism of AI tutors is the risk of learner passivity – if the agent does too much, students might become
disengaged or over-reliant on the AI. Therefore, a central design principle is the preservation and promotion of
learner agency. GenAI agents can support this in several ways. One approach is through developing learners’
metacognitive awareness. This may be achieved by, for example, posing open-ended questions that allow learners
to guide the direction of the interaction, thereby fostering their awareness to steer their own learning journey. Even
simple prompts, such as “Would you like another hint or should we try a different problem?”, position learners in an
active decision-making role. Interfaces can further enhance agency through interactive controls. For instance, the
SPL’s interface offers learners with options to request a simpler explanation, pose a question to the agent, or indicate
that they wish to attempt the solution independently. These controls act as safety valves so the student can modulate
the help level. Under the hood, the agent monitors these inputs and adjusts its strategy – if a student repeatedly asks
for simpler explanations, the agent will reduce the complexity of its language or break problems into smaller steps;
if the student wants to proceed independently, the agent will step back and take on a more observational role, ready
to jump in only if asked.
Another technique to maintain agency is via implementing turn-taking policies that ensure the AI does not dominate
the dialogue. For instance, after the agent poses a question, it should give the learner ample time to think and
respond, rather than immediately filling silence with more talk. If a student seems stuck, the agent may offer a
hint, but ideally after encouraging the student to articulate any partial thinking first. This aligns with the AI tutoring
technique of offering minimal help to keep the student doing as much cognitive work as possible: the goal of such
systems is to reach an “interactive” level of engagement, where the student and tutor co-construct knowledge. From a design perspective, such an implementation can be useful to measure the proportion of conversation
led by the student vs. the agent; some prototype evaluations of SPL looked at what percentage of words or turns
were student-generated and aimed to maximise that over time through interface tweaks. Furthermore, the agent can
foster agency by being explicitly reflective: encouraging students to set goals, ask their own questions, or evaluate the
agent’s suggestions. For example, the agent might say, “Do you agree with the approach I just suggested, or do you
think there’s a better way?” – prompting the learner to critically assess the AI’s previous responses, thereby treating
the student as an active participant with agency, not just a recipient of knowledge.

The design of generative pedagogical agents must carefully blend AI innovation with human-centric educational
principles. Frameworks like ARCHED provide macro-level guidance on maintaining transparency and human control in
AI-powered education systems (Li et al., 2025[21]). The micro-level design in agentic systems such as SPL (Hu, Xu and
Graesser, 2025[4]) illustrates concrete features that enact those principles (for example, structured prompts, adaptive
questioning, interface controls). By prioritising explainability, scaffolded interaction, multimodal engagement, and
learner agency, designers can create AI tutors that are not only powerful and adaptive, but also pedagogically sound
and user-friendly. As subsequent sections will show, these design considerations play a critical role in addressing
the challenges and ethical implications of generative AI tutors, ensuring that technology serves as a complement to
effective teaching rather than a detour from it.

To ground the discussion in a concrete example, this section introduces the operational Socratic Playground (SPL)
prototype and examines how it functions in real educational scenarios. The SPL system1 serves as a demonstration
of generative Socratic tutoring in action (Figure 3.1), focusing initially on the domain of essay writing and critical
thinking. This section will describe a typical user experience with SPL, summarise preliminary evaluation data on its
effectiveness, and discuss practical challenges encountered during deployment. The lessons learned from SPL’s pilot
use – including user feedback and technical issues like latency and hallucinations – highlight the gap that can exist
between research promise and practical deployment, offering valuable insights for future improvements.

In the piloting scenario with SPL, learners were asked to compose a short argumentative essay. For example, one
prompt might be “Should renewable energy be subsidised by the government?” Instead of grading the essay outright
or providing a static set of comments, the SPL agent engages the learner in a multi-turn Socratic dialogue about
their essay (Figure 3.2). The session typically begins with the agent greeting the user and asking to see their draft or
initial ideas. Suppose the student writes a few sentences stating their position. The agent will analyse this input (via
GPT-4 and the underlying prompt structure) and then respond with a thoughtful question – often a “why” or “how”
question – aimed at deepening the student’s argument. For instance, if the student asserted “Yes, renewable energy
should be subsidised because it’s better for the environment”, the agent might ask, “Why do you think government
subsidies are necessary for environmental benefits, as opposed to letting the market handle it?” This kind of whyquestion scaffold pushes the student to elaborate on their reasoning. The student then responds, perhaps adding
that “without subsidies, renewable projects might not attract investment”. The agent continues this process, maybe
following up with another prompt like, “Can you think of a specific example or evidence that supports that point?”
Through such iterative questioning, the student is led to flesh out their argument with reasoning and evidence,
essentially engaging in critical thinking about their own writing.
One notable observation from the SPL demonstration is that students often improve the quality of their
reflections and explanations during these dialogues. Preliminary data collected from pilot users (university
students in a writing workshop) suggest that after interacting with the Socratic agent, the students’ final
essays included more justification for claims and considered counterarguments more frequently than their
initial drafts. While this is not a controlled study result, it aligns with the expectation that prompting learners
to explain and justify would yield deeper engagement with the material.
The GenAI agent essentially acts as a catalyst for self-explanation, a well-known mechanism for learning gains. Users reported that the agent’s questions made them think more critically: one participant noted “The AI
asked me things I hadn’t considered, like how exactly the subsidies work. It was challenging but it made my argument
better.” This anecdotal feedback resonates with our goals – the agent is not providing direct answers but improving
the learner’s thought process and output.

The SPL system also demonstrates personalisation by adapting to different users’ needs within the essay task. For a learner who is struggling to generate content (Figure 3.3), the agent takes on a more supportive, even slightly leading role. It might break down the task: “Let’s start by outlining two main reasons you support subsidies. What’s one reason?” If the student is totally stuck, the agent can even offer a gentle nudge like, “One reason might be related to climate change – do you want to expand on that or think of another reason?” On the other hand, for a confident learner who writes a strong initial paragraph, the agent switches to a more challenging role – perhaps by introducing a counterpoint: “Some critics argue subsidies distort the market. How would you respond to that counter-argument in your essay?” This not only personalises by difficulty but also by role: with the less confident student, the agent was a coach breaking down the task, whereas with the advanced student, it became a debate partner injecting opposing views. The underlying mechanism enabling this is the continuous profiling of student performance and the flexible prompt template described earlier. Essentially, after each student response, the system classifies how well the student is doing and decides on a strategy (simplify vs. challenge, new question vs. give hint). This can be thought of as following the student’s Zone of Proximal Development – always trying to operate just above the current level. By observing features like the length and substance of the student’s answers, the system adjusts its scaffolding depth. Early user sessions show this adaptive behaviour could lead to deeper engagement: students at earlier stage of mastery benefit from tasks broken into manageable steps, while more advanced students remain engaged through progressively complex problems. Users in a professional development workshop who tried SPL noted that the agent felt “attentive”, an impression due to this adaptive behaviour.


Although comprehensive efficacy studies are still to be done (and this paper later suggests a framework for
such studies), initial trials of SPL have been encouraging. In a pilot at the Hong Kong Polytechnic University
involving about 20 adult learners (teachers and professionals exploring GenAI tools), participants reported
high levels of satisfaction with the agent’s usefulness. A post-session survey using a 5-point Likert scale found
that the majority agreed (4 or 5) that the AI tutor’s questions helped them think more critically about the topic.
Many also indicated they would like to use such a tool regularly for brainstorming or refining their writing.
From the system’s perspective, logs showed that the dialogues often went through 8 to 12 turns of Q&A, with the
learner contributing increasingly complex answers. Linguistic analysis of the student responses from their first turn
to the final turn in each session indicated an increase in lexical diversity and sentence complexity, which can be seen
as proxies for richer content (though more rigorous content analysis is needed). These observations align with the
idea that Socratic generative tutoring fosters deeper reflection and engagement, resulting in improved outcomes
like more thoughtful essays. Furthermore, previous findings suggest that a GPT-4 powered Socratic tutor (like SPL)
facilitates more effective tutoring interactions than legacy ITS, a result qualitatively and empirically
reinforced by our hands-on trials. However, the SPL demonstration also surfaced significant challenges, offering a reality check on the hype surrounding
AI tutors. One practical issue was GPT-4’s latency, as anticipated. In some cases, users had to wait 5 to 10 seconds
for the AI tutor to respond, particularly as the prompt context increased (for instance, after the student had written
a few hundred words of essay, sending all that plus the JSON structure to the model could prolong the response
time). While many users were patient and understood this was a prototype, a few found the waiting time disruptive,
especially when the conversation was flowing and then had a pause. This underlines a need for further optimisation
or perhaps using a faster model for intermediate turns. Another issue observed was occasional hallucinations by
the AI tutor. For example, in one session the student mentioned Germany’s renewable energy policy, and the agent
responded with a detailed “reminder” about an apparent German law that was not actually real – it had fabricated a
plausible-sounding fact. The student was savvy enough to question it (“I’m not sure that law exists”) and the agent
then backtracked, but this could mislead less knowledgeable learners. We have since added additional checks when
the agent provides factual statements, but it is a stark reminder that
even a pedagogically well-intentioned AI can
introduce misinformation. Ensuring factual accuracy remains an ongoing battle in
generative AI tutoring, reinforcing
arguments in the literature about the need for grounding and verification.

The SPL pilot also highlighted some interface issues. For instance, the initial version did not make it obvious that the user could ask the agent questions at any time; some users thought they could only answer the agent’s questions. This one-sided interaction wasn’t the intention – the system was capable of handling user-initiated questions or clarifications, but the UI cues were not clear. In response, we adjusted the interface to clarify that users can ask the AI tutor for explanations or request a hint. Another minor but interesting observation was that some users interacted with the AI tutor in a formal tone at the beginning of their conversation (for example, “Dear tutor, I have a question…”). Over time they became more conversational as they realised the agent responds like a human would. This acclimatisation suggests that building user trust and familiarity with the agent’s style is part of adoption; any deployment should consider an onboarding or tutorial that lets users get comfortable with talking to the AI. From a design perspective, we found that maintaining a user’s sense of control was vital. When a participant disagreed with the AI’s suggestion, the AI persisted in pressing its point with the intention to be thorough, which left the user feeling frustrated. In subsequent tweaks, we have the agent explicitly acknowledge and respect the user’s viewpoints more (for example, “That’s a valid perspective. Shall we explore it further or do you want to consider alternative angles?”). This preserves the pedagogical goal of reflection but avoids the impression of the AI insisting on its way. Such fine-tuning makes the agent more like a supportive guide than an interrogator, which is important for sustained engagement.
The Socratic Playground demonstration provides a valuable case study in the real-world implementation of a
generative pedagogical agent. It points to several potential benefits that are emerging in discussions of
LLM-driventutors (personalised scaffolding, improved critical thinking in student work, and positive learner reception) . At the same time, it unearths the pragmatic issues that
arise when moving from controlled development to practical use: latency, occasional AI errors, interface clarity, and
the delicate balance of control between student and agent. The experiences from SPL underscore a central theme of
this chapter: there remains a gap between the research promise of
GenAI in education and the practical deploymentof these tools, which can only be closed through iterative refinement, user-centred design, and rigorous evaluation.
The integration of GenAI agents into education has moved beyond theoretical promise to rigorous empirical
validation. In recent years, some randomised controlled trials and large-scale field studies have provided valuable
insights on the current efficacy of ITS powered by GenAI. In this section, we synthesise emerging evidence across
three distinct deployment models: hybrid/human-in-the-Loop (augmenting human tutors), independent tutoring
(replacing or supplementing lectures), and classroom integration (supporting real-time classwork), and concludes
with a streamlined framework (see Box 3.3) for evaluating the efficacy of such systems.

A first scenario for the use of GenAI systems is that the AI does not teach the student directly but acts as a real-time “whisperer” for a human tutor, suggesting pedagogical moves to enhance instruction. The most prominent example is Tutor CoPilot, deployed in a large-scale randomised controlled trial involving 900 human tutors and 1 800 high school students. The study found that while using the GenAI tutor improved student mastery by 4 percentage points on average, its true power lay in "levelling up" the workforce. Students paired with lower-rated or novice tutors who used the CoPilot saw learning gains of 9 percentage points compared to the control group, effectively closing the gap between novice and expert tutoring. Analysis of chat logs revealed the mechanism: the GenAI system successfully nudged inexperienced tutors away from simply giving answers and toward using expert scaffolding strategies, such as asking guiding questions. This suggests that one of the most effective uses of GenAI is not to replace humans, but to scale expert pedagogy across a variable workforce. Another scenario involves students interacting directly with an AI pedagogical agent to learn new concepts or accelerate their study, often outside standard classroom hours. At Harvard University, the Harvard Physics Tutor (a custom GPT-4 agent) was tested against a “gold standard” active learning classroom in a randomised crossover trial. The results were striking: students using the AI tutor achieved learning gains more than double those of the active learning group (effect size d≈0.73-1.3) and, crucially, spent significantly less time to reach that proficiency. This highlights the efficiency of “hyper-personalisation”, where the GenAI addresses specific misconceptions that a classroom teacher cannot individually address for every student simultaneously. Similarly, in the distance learning context, IU International University deployed Syntea, a GenAI teaching assistant, to over 10 000 students. The primary metric of success here was “learning velocity”: students using Syntea reduced the average time required to complete a course by 27% while maintaining exam performance. By acting as an always-available Socratic study partner, the GenAI agent removed the “wait time” for feedback, effectively accelerating the learning loop. In low resource settings, the text-based math tutor Rori demonstrated that high-fidelity interfaces are unnecessary for impact. Deployed via WhatsApp to over 1 000 students in Ghana, Rori produced significant math growth (effect size d=0.37) at a marginal cost of roughly $5 per student, proving that conversational AI can bridge the digital divide even on basic mobile infrastructure. Thirdly, GenAI tutors can be used alongside standard instruction for practice problems. In that case, current evidence points to a high risk of cognitive offloading if “guardrails” are absent. (Cognitive offloading is the act of using external tools or resources to reduce the mental effort required to perform a task or remember information.) A study involving nearly 1 000 high school math students compared a standard “GPT Base” model against a pedagogically engineered GPT Tutor. Students given unrestricted access to the “Base” model performed 48% better during practice but 17% worse on subsequent independent exams, a phenomenon termed the “Crutch Effect”. The students had learned to use the AI to bypass the cognitive struggle necessary for learning. The “GPT Tutor”, intentionally engineered to withhold direct answers and prompt for self-explanation, mitigated this harm but did not yield the artificial performance boost seen in the base group. Other classroom tools like Khanmigo (Khan Academy) have shown mixed quantitative results but strong qualitative benefits. While some trials of Khanmigo showed “no statistically significant difference” in short-term test scores compared to standard web search, students reported a significant reduction in “evaluation apprehension”. They felt safer asking “stupid questions” to the AI than to a teacher. In summary, given this landscape of heterogenous outcomes – ranging from accelerated mastery to skill degradation – it is clear that efficacy is not inherent to the technology but dependent on implementation. More importantly, rigorous evaluation to the implemented tools should be warranted to distinguish authentic learning gains from deceptive performance boost. This calls for continuous and systematic evaluation of different uses of Gen-AI powered tools for learning. Box 3.3 provides some ideas of measures for evaluating these tools.
Deploying GenAI agents to facilitate technology-enhanced tutoring brings specific challenges that must be
addressed to ensure these tools are responsible, equitable and educationally effective. While the potential for
personalised, adaptive tutoring is vast, the implementation must navigate technical limitations and pedagogical
risks.
A primary technical challenge in dialogue-based tutoring is the tendency of generative
LLMs to “hallucinate”, producing
plausible but incorrect information. In a Socratic context, where the tutor leads the student
through a chain of reasoning, a false premise introduced by the AI can derail the entire learning process. If students
internalise these errors, the damage is significant. Studies have already observed students reproducing AI-introduced
errors in homework tasks To mitigate this, systems increasingly employ
Retrieval-Augmented Generation (RAG) to ground AI responses in
trusted corpora, such as textbooks. Additionally, fairness remains a critical concern.
LLMs can
exhibit performance gaps across languages or dialects, potentially disadvantaging non-native speakers. Furthermore, without careful calibration, an AI tutor might inadvertently favour specific cultural perspectives
or arguments, undermining the neutrality required for effective tutoring.
Perhaps the most significant pedagogical challenge is maintaining the delicate balance between support and
independence. There is a valid concern that over-reliance on AI assistance may reduce mental effort and compromise
the depth of inquiry. If an AI tutor is too directive, or if the student passively
accepts the AI’s guidance, the metacognitive benefits of the method – self-evaluation and critical thinking – are lost.
Designers must thus ensure the AI empowers the learner rather than making them a passive recipient. This involves
transparency in why a question is asked and explicitly prompting students to verify information, fostering the
metacognitive development essential in the era of GenAI.
Consistent with the consensus in the field, AI tutors should be viewed as tools to augment, not replace, human
educators. The ethical safeguard for these AI tutors should be a “human-inthe-loop” approach, where teachers retain oversight of the AI’s guidance. Teachers must have the agency to determine when the AI is used – for
example, assigning it for preliminary homework discussions so class time can be reserved for deeper analysis. This requires distinct professional development to ensure teachers are literate in interpreting
AI outputs and intervening when the system’s logic drifts.
Beyond the specific pedagogical dynamics, the broader deployment of these agents requires strict adherence to
operational and ethical standards. As Luckin and Holmes argue, technological innovation must be
paired with ethical guardrails.
Key considerations include:
• Data Privacy and Governance: Systems must comply with regulations like European Union’s GDPR or United States’
FERPA. Since AI tutors collect deep behavioural data, strict anonymisation and access controls are required to protect
student privacy (Colonna, 2023[48]).
• Infrastructure and Equity: Deploying LLMs like GPT-OSS and Qwen3 is computationally expensive. To prevent a
digital divide where only wealthy institutions access high-quality GenAI-powered tutoring, strategies must include
subsidised access or the use of optimised, lower-cost models.
• Transparency and Trust: It is an ethical imperative to transparently label the agent as an AI. Users should be
informed of the system’s limitations – specifically its potential to hallucinate – to encourage critical evaluation rather
than blind trust.
The advent of generative AI in pedagogical agents is only the beginning of a broader transformation in educational
technology. This section outlines future directions and a research roadmap for advancing this field. It highlights
several promising avenues:
• Authoring tools and platforms for educators to easily create and customise AI-driven tutoring content without deep
technical knowledge;
• Multimodal GenAI agents that incorporate vision, speech, and possibly other sensory inputs to create more holistic
learning experiences;
• Multi-agent and collaborative AI systems, where multiple AI tutors or AI-student peers interact with each other and
learners to simulate group learning dynamics;
• Lifelong learning companions that accompany and support learners over extended periods (across courses or
years), adapting as the learner grows;
• Cross-context adaptive deployment, ensuring these agents can transition and be effective in varied contexts (from
formal classrooms to informal learning, across different subject domains and age groups).
We also propose future
research methodologies, including large-scale trials and longitudinal studies, to validate and refine the impact of
these systems.
Finally, we emphasise that GenAI agents are an evolving technology that will require continuous evaluation of
effectiveness, accessibility, and alignment with pedagogy as they develop.
First, for generative AI tutors to be most useful and widely adopted, educators need to be able to create and customise
content easily. Relying on AI experts to build every lesson is not scalable. Therefore, a crucial area of development
is teacher-facing authoring tools that leverage AI to help produce AI-driven lessons. For example, a teacher might
input the learning objectives and key points for a lesson, and the system could generate a draft prompt template or
a series of questions aligned with that objective. The teacher could then review, refine, and approve the AI-generated
content. Alternatively, a teacher could demonstrate a desired dialogue flow once – either by conversing with a mock
student or by outlining it explicitly, and the AI tutor could adopt that style. Additionally, AI could support the creation
of simulations or narrative-based learning activities. For instance, if a teacher requests a scenario in which a student
debates an AI acting a historical figure on a given topic, the system could generate an initial script for subsequent
teacher revision. Such tools would dramatically lower the barrier to implementing customised AI tutoring across
different subjects and languages.
A research path here includes understanding how teachers conceptualise AI behaviour and making interfaces
that map to their thinking (e.g. some might prefer a rule-based interface, others might want to give examples
and have the AI generalise – similar to programming by demonstration). Co-design with educators will be key;
early studies should involve teachers using prototype authoring tools and measure outcomes like how quickly
they can develop a new lesson, how effective that lesson is for students, and how comfortable the teachers
feel about the level of control and transparency in the AI’s resulting behaviour. The ARCHED framework is a
step in this direction as it structures AI involvement in instructional design with human oversight at each stage. Future research can build on ARCHED to apply similar principles to real-time tutoring content
creation.
One clear direction for future research is extending beyond text to multimodal interactions. Humans communicate
and learn through a rich mix of modalities – speech, gesture, writing, drawing, etc. Future AI tutors likely will support
student learning by doing the same. Already, models like GPT-4 have some multimodal capabilities (accepting image
inputs, for example), and research is ongoing on integrating visual understanding with language models. A future
AI-powered pedagogical agent might watch a student solve a physics problem on paper (via a camera), diagnose a
misconception from their written work or diagrams, and then provide verbal guidance. Or in a virtual lab, the agent
might observe how a student assembles a circuit or conducts a simulation and intervene at the right moment. Visionenabled agents could check a student’s worked solution for errors by “seeing” it, much as a teacher would glance
at a notebook. Meanwhile, speech interfaces will allow more natural use in contexts where typing is inconvenient –
imagine language learners practicing conversation with an AI that not only speaks but also reads facial expressions
to gauge affection (e.g. confusion). Moreover, embodied agents in AR/VR could provide immersive tutoring – for instance, a holographic science tutor
that appears in an AR headset to guide a student through a chemistry experiment in a lab. Embodiment can leverage
the physical environment: for instance, in mixed reality, an agent can point to parts of a model or demonstrate with
virtual objects. Multimodal agents may be designed to enrich students’ learning experience, better aligning with
theories like Dale’s Cone of Experience and Kolb’s experiential learning cycle, which
emphasise learning by doing and experiencing. Early trials of such approaches (like the DALverse platform mentioned
earlier, integrating LLMs with a metaverse) have shown increased engagement and improved retention. The challenge for researchers is to seamlessly integrate modalities such that the AI can interpret and
generate multi-sensory data coherently. This may involve combining specialised models (for vision, for speech) with
LLMs, or training unified, multi-modal models. It also raises new questions: How does one evaluate learning in these
richer environments? How to ensure the added modalities truly improve learning and are not just gimmicks? These
will be crucial questions to be answered as this line of research advances.
Nonetheless, a likely near-future scenario is a tutor that can speak and listen (already feasible with speech-to-text
and text-to-speech integration) and perhaps use simple graphics or diagrams on the fly (e.g. drawing a chart using
data provided in teacher-prepared curriculum materials). Ultimately, multi-modal generative agents aim to mimic
a human tutor not just in conversation, but in full instructional presence – writing, sketching, demonstrating, and
responding to non-verbal cues.

Another exciting frontier is the use of multiple agents to enrich educational interactions. Rather than one AI tutor and one student, we could have scenarios with several AI characters and one or more students. For example, a multi-agent system might include an AI tutor plus an AI peer learner; the human student can then participate in a group dialogue. This could simulate collaborative problem-solving or Socratic debates, exposing students to diverse viewpoints. Park et al. successfully developed “generative agents” that interact with each other to simulate human-like social behaviour. Researchers have proposed
multi-agent frameworks (EduMAS) for educational support. One agent might propose a solution, another critique it, and the human student could be asked to arbitrate or contribute, thereby learning through a rich discussion. Alternatively,
multiple agents could take on specialised roles: one focusing on content hints, another on motivational encouragement, and a third maybe representing a historical figure or stakeholder in a debate (imagine learning civics by having AI agents role-play different political viewpoints in a discussion with the student). These multi-agent interactions, if well-orchestrated, can model productive dialogue patterns and expose learners
to argumentation and perspective-taking that is hard to achieve with a single tutor. However, designing multi-agent
systems is inherently complex as the systems must be designed in ways that preserve consistency and prevent
confusion (the agents must not overwhelm or contradict in a detrimental way). Research will need to explore optimal
designs: What is the ideal number of agents within a multi-agent system? What role combinations are effective? Some
early studies outside of education suggest multi-agent debates can improve answer accuracy by agents critiquing
each other, but in education the goal might be more to model peer discussion or provide
contrast.
Team Tutoring, where AI supports collaborative groups of students, is
related; the AI could moderate or participate in a student group discussion, making sure everyone contributes (this
uses multi-agent in the sense of AI + multiple humans). Overall, harnessing multiple AI agents to facilitate social
learning is a promising direction, aligning with socio-constructivist theories that knowledge is often built through
discourse.

Envisioning further ahead, generative AI agents could become continuous learning companions that support an
individual over years, across many domains. Instead of separate tutors for math, science, writing, etc., one AI (or an
integrated system) could accompany a student through their educational journey, maintaining a long-term model
of their interests, strengths, and weaknesses. This idea resonates with the concept of personal AI assistants and
also ties into educational initiatives for personalised lifelong learning (Krinkin, 2026 (forthcoming)[20]). For example,
a student’s “AI mentor” might help them in middle school algebra, then later adapt to help with high school physics,
remembering that the student struggled with calculus concepts and proactively reinforcing those when they appear
again. It could even extend beyond formal schooling: as the student goes to university or job training, the same AI
knowing their learning history could tailor new learning experiences effectively. Implementing this raises many questions – technical (how to store and update the long-term learner model securely),
pedagogical (how to ensure continuity leads to cumulative benefits and not compounding of earlier biases), and
ethical (ownership of that data, the right to “reset” or change one’s AI companion to avoid pigeonholing based on
early performance). Research by Tong and Hu on self-improving adaptive instructional systems and by
others on neuro-symbolic architectures (such as the NEOLAF AI service for education in the United States is trying to tackle the idea of AI that can improve itself and adapt over time, which is related to building
a durable lifelong tutor. A roadmap for this could involve pilots where the same AI system is used across multiple
courses or grade transitions, observing if retention or transfer improves because the AI can remind the student of
previous knowledge or adapt to their cumulative profile.
A possibility is that a lifelong AI companion could also foster lifelong learning habits – by being present beyond the
classroom, it might encourage curiosity, recommend learning opportunities, or help with personal projects (like an AI
that helps a student interested in music by finding them educational resources or setting practice goals). Essentially,
it would blur the lines between formal and informal learning support.
Another future direction is ensuring that generative agents can adapt to various educational contexts easily. At
present, considerable effort is required to adapt systems to each new domain or use-case. In the long run, more
generalist AI tutors might rapidly adapt to and learn new content. Few-shot learning capabilities of LLMs are promising
here – perhaps a tutor can be given a single lesson text or some examples of questions and answers in a new subject and then operate as a tutor in that subject. The robustness across contexts also includes adapting to different
educational levels (e.g. using simpler language for a 5th grader versus a college student – something LLMs can do
to an extent by role prompting). It also includes cultural context adaptation: a global AI tutor might need to know
local curricula or examples relevant to the learner’s environment. Research might explore how to embed contextual
knowledge without a complete retraining – maybe by plugging in local knowledge bases or letting the system be
easily fine-tuned by local educators.
Additionally, cross-context might involve the tutor being used in different settings: formal class vs. after- school vs.
workplace training. We should explore how it needs to adjust its style (formal vs. casual, directive vs. self-directed
learning mode) depending on context. The ultimate vision is an AI tutor framework that is as flexible as a human
teacher who can teach different subjects, age groups, and adapt teaching style – a very remote goal, but research in
transfer learning and multi-domain training of AI models is making progress.
The evolution of pre-scripted,
AI-powered pedagogical agents towards sophisticated generative AI tutors represents a
profound shift in educational technology. This chapter has traced this transformation through the lens of the Socratic
Playground (SPL), demonstrating that the move from scripted tutor avatars to generative Socratic companions is
not merely a technological upgrade, but a reimagining of educational possibilities.
GenAI can function as a dynamic
conversational partner capable of adaptive guidance and deep dialogue, provided that pedagogy remains the core
driver. The “pedagogy-first” principle and integration in the system design is both an ethical imperative and the practical
key to success. In essence, AI tutors must be as much products of educational craftsmanship as of computational
prowess. Digital innovations in education hinge on the synergy of advanced AI with sound teaching approaches.
However, the rapid advancement of these systems necessitates a shift from proof-of-concept to rigorous, continuous
evaluation. As
GenAI agents become commonplace, the research community must move beyond novelty to conduct
large-scale randomised trials that examine holistic outcomes. It is crucial to probe not only subject knowledge gains
but also metacognitive shifts: do students develop better learning strategies, greater self-regulation, and sustained
interest, or does the motivation fade once the novelty of the AI wears off? Furthermore, as models inevitably evolve
(e.g. from GPT-4 to future iterations), the educational quality cannot be assumed to remain static. A robust process
is needed to re-validate agents with each major upgrade to ensure they remain aligned with learning goals rather
than becoming distractions.
This ongoing validation must prioritise inclusivity and ethics alongside effectiveness. Ensuring factual accuracy,
fairness, and privacy are not optional add-ons but fundamental to the integrity of
GenAI-powered tutors. To achieve
this, various safeguard mechanisms in system implementation could be adopted – ranging from bias audits and
human-in-the-loop frameworks to alignment with international guidelines like those from the OECD, UNESCO, and
the EU. By implementing such safeguards, the international community can strive to make GenAI pedagogical
agents not only effective but also trustworthy and inclusive. As features become more multimodal and autonomous,
these efforts must specifically target accessibility, ensuring accommodations for diverse learning needs. When
implemented carefully, these measures can foster trust. Students may welcome the personalised support and
teachers appreciate the augmented capabilities provided, dispelling fears of AI as an unwanted intruder. Ultimately,
navigating these challenges requires deep interdisciplinary collaboration among AI researchers, learning scientists,
educators, policymakers, ethicists and other stakeholders, as no single group holds all the expertise needed to
perfect these systems.
In Art Graesser’s early work with AutoTutor, the dream was to simulate skilled tutoring
dialogue. Today,
GenAI and systems like the Socratic Playground bring us much closer to fulfilling those aspirations.
Yet, this technology should not be pursued for novelty’s sake, but to amplify and democratise the best of teaching
– enabling rich, adaptive mentorship for every student, regardless of geography. Realising this vision requires a
“pedagogy-first” ethos and the development and use of AI tools that augment rather than diminish human intellect.
By combining the empathy of teachers, the rigour of learning and education scientists, and the computational
consistency of AI, the education community can author a future where GenAI agents are harnessed responsibly to
close the learning divide and become a success story for learners everywhere.
Comments
Post a Comment