Generative AI for human skill development and assessment: Implications for existing practices and new horizons.
Generative artificial intelligence (GenAI) is transforming the landscape of education by reshaping how
skills are developed, assessed, and supported. This section synthesises recent empirical evidence on
how GenAI technologies influence instructional practices, feedback, and assessment. It examines
both the opportunities and limitations of using GenAI to provide personalised tutoring, enhance
feedback quality, and automate assessment practices. The chapter argues for a careful balance
between human skill development and AI-augmented performance, emphasising the need for
pedagogically grounded integration of GenAI within intelligent tutoring and assessment frameworks.
It concludes by outlining directions for research and policy that ensure GenAI strengthens, rather
than substitutes, human learning and instructional expertise.
The wide adoption of generative artificial intelligence (GenAI) – after the public release of ChatGPT in November 2022
– has triggered profound debates about their implications on education. GenAI can provide technologies that can
support skill acquisition through personalised instruction and feedback, and enhance the efficiency and effectiveness
of teaching practices. However, GenAI poses ethical challenges and risks as well. The developments in GenAI triggered educators, education leaders,
and policymakers to engage with GenAI extensively, rethink pedagogical, assessment, and governance frameworks
to harness GenAI’s potential while mitigating its risks. Through these efforts, many education institutions have
produced policies and guidelines to support staff and students in using generative AI. Similarly,
many government, intergovernmental, nongovernmental, and non-for-profit organisations have also produced
documents that inform GenAI adoption, responsible practices, and frameworks for professional development of
educators.
Equally so, the rapid developments in GenAI have also mobilised many researchers to study implications on education
and human learning.
This section aims to summarise recent evidence about the implications of GenAI in human skill developmentand assessment. The focus will be on human skill development and assessment as they are central to education
and professional development programs. The analysis of the implications of GenAI on human skill development
and assessment is particularly framed around two complementary perspectives. First, GenAI technologies offer
some promising prospects for advancing our existing practices related to skill development and assessment.
For example, GenAI can be used to provide interactive instructional support, provide personalised feedback at scale, and automate the creation and implementation of assessments. Second, GenAI challenges our existing assumptions of our learning practices and calls for novel
ways for assessment. For example, while GenAI can increase performance in certain situations, it can also limit
human agency and result in overreliance on AI. Finally, we need to strengthen research methods that are used to study human skill development and
assessment in the age of GenAI to avoid challenges recently noted in the literature, such as conflation of learning
and performance.
This section is based on the analysis of empirical evidence published in the research literature. It offers a summary
of the existing evidence about effectiveness of GenAI to support existing practices for instruction, feedback, and
assessment given their central roles in education and professional development. It also describes the recent
conceptualisation of hybrid human-AI skills that recognises the need to support development of human skills while
enhancing task performance with the use of GenAI. The section concludes by providing implications for practice and
policy grounded in existing evidence and promising directions for future research. Box 2.1 provides a glossary of the
main terms and types of generative AI and associated techniques in the field of AI in education (AIED).
Providing enhanced instructional support at scale is one of the most prominent areas for the use of GenAI in
education. This is grounded in the idea of making use of GenAI for developing systems that can
offer personalised learning support. The idea of personalised learning support is grounded in Bloom’s “two‑sigma problem” showing the significant benefits of one-to-one instruction over other forms of instruction. Before
widespread use of GenAI, the effectiveness of personalised learning support has been long studied in the literature
on artificial intelligence in education, particularly focusing on intelligent
tutoring systems and resulted in the development of many
effective tutoring systems – e.g. SQL-Tutor, MetaTutor, andCognitive Tutors. Especially relevant to today’s attempts to provide personalised
learning support are intelligent tutoring systems such as AutoTutor and BEETLE that were already designed to provide tutoring through dialogue in natural language. This
research also informed the development of many commercial tutoring systems such as MATHia based on Cognitive
Tutor and ALEKS. However, rapid development of such systems
still remains a challenge.
GenAI offers promising approaches that can be used for rapid development of instructional systems for personalised
support. Specifically, GenAI, through the use of large language models, can be leveraged to develop tutoring chatbots.
A prominent example is Khan Academy’s Khanmigo chatbot, which makes use of large language models to conduct
scaffolded, Socratic‐style tutoring across diverse subject areas. As one of many emerging GenAI
chatbots in education, Khanmigo illustrates how these technologies can scale personalised learning support and
expand opportunities for learner autonomy and exploration. However, at the time of writing of this chapter, there
have been no studies published that evaluated the effectiveness of Khanmigo on learning (although at least one preregistered RCT is ongoing in Canada) Evidence on the effectiveness of GenAI to enhance instructional support is still emerging and offers mixed support.
For example, a randomised controlled trial conducted at Harvard University showed the significant effects (0.73-1.3
standard deviations) of an AI Tutor – ChatGPT powered system – over those attending in-person active learning
classes in an undergraduate physics course. The World Bank has recently reported the findings
of a randomised controlled trial in nine secondary schools in Nigeria. In the trial, students
were randomised in the treatment group that received access to Microsoft Copilot based on GPT-4 in an after-school
programme and the control group who did not have access. The students in the treatment group received teacher
instructions on how to use Copilot including the prompts and worked in pairs with other students. The results showed
the positive effects of this intervention with the effect size of 0.31 standard deviations. However, this effect size was
lower than the average effect size noted in the meta-analyses of the effectiveness for intelligent tutoring systems –
i.e. 0.42–0.57 standard deviations according to Ma et al. and 0.66 according to Kulik and Fletcher.
This suggests that past AI tutors might be more effective, albeit the difference in context. Nevertheless, the World
Bank study findings aligns with the range observed in promising computer-assisted learning interventions reviewed
by Escueta et al., who identified effect sizes between 0.18 and 0.63 standard deviations for personalised
and adaptive programs, particularly in mathematics. The World Bank study also showed that the students with high
prior academic performance and high socio-economic status particularly
benefited from the interventions. While these findings come from a context where socio-economic disparities are
likely more pronounced than in most OECD countries, they still suggest that GenAI-based tutoring systems may
disproportionately benefit certain groups of students. Future research in diverse educational settings is needed to
corroborate this pattern.
The way how a GenAI-based tutoring system is configured and used may have profound implications on learning.
In a large-scale field experiment in high school math classrooms, Bastani et al. found that while
GPT-4-based tutors improved performance during use (up to 127%), students who used a standard chatbot akin to
ChatGPT performed worse (17% lower performance) than the control group once access to the chatbot was removed.
The control group students did not use any GenAI-based instructional support in addition to the conventional
classroom instruction. This negative effect of GenAI use was mitigated by a version designed with learning safeguards,
suggesting that poorly configured systems may undermine long-term learning.
Similarly, Lehmann et al. showed that a ChatGPT-based tutor for Python programming had no
overall effect on learning, but its impact depended on usage patterns. Students who heavily relied on the
ChatGPT-based tutor tended to cover a broader range of optics but developed shallower understanding, while
those who used it to complement learning gained deeper understanding. The use of the ChatGPT- based tutor also
widened performance gaps between students with high and low prior knowledge. In summary, the Bastani et al. and Lehmann et al. studies highlighted the importance of instructional strategies embedded in
the design of GenAI-based instructional systems and the way how students use them are two key factors that need
to be considered in research and practice. Future research should also examine how best to combine generative
and conventional AI models, since, adding LLMs investigate effectiveness of different pedagogical approaches
GenAI-based instructional systems use and factors (e.g. metacognitive skills) that explain different usage patterns of
students. It is also important to examine how conventional and generative AI models can be effectively integrated.
Such integration should enhance human learning by combining the strengths of each layer while minimising risks,
including hallucinations that may compromise reliability.
Effective instructional support from GenAI-powered systems requires alignment with the proven experience
in developing intelligent tutoring systems. Although frequently mentioned in recent developments,
GenAI-powered instructional systems do not follow a typical architecture of intelligent tutoring systems. At the core of intelligent tutoring systems are learner models, tutor models, domain models, and user interface. Most GenAI–based instructional systems primarily make use of large language
models to cover functions of all these four components. While user interfaces through natural language interaction
can be quite advanced with LLMs, the support for the other three components is less obvious. Although the
functions of the domain and tutor models can be performed by LLMs to some extent, there is presently limited
research and evidence on how their quality can be assured. Specifically, due to the stochastic nature of LLMs,
they cannot guarantee reliability of information covered in the domain model due to tendency to hallucinate (Ji
et al., 2023[41]). Existing research shows that LLMs can easily be distracted and inconsistently comply with the
instructions provided in the underlying prompts.
Therefore, future research is needed to assess the extent to which LLMs can consistently comply with a particular
tutoring strategy to offer long-term effects. Moreover, future work is needed to develop effective computational
approaches that can increase compliance of LLMs to promote effective tutoring strategies over time. Finally,
there is very little evidence in the literature that existing GenAI-based instructional systems offer any learner
models. They are precisely needed to understand individual student needs based on the tracing of their knowledge
development and learning approaches they take to provide adaptive
and personalised support. Recent evidence by Borchers and Shou shows that LLM-only tools offer only
limited adaptivity compared to conventional intelligent tutoring systems. Future research is needed to address
these critical architectural needs and identify effective ways of the integration of LLMs within tutoring architectures
to enhance instructional effectiveness.
Enhancing instructional support does not necessarily need to be done through providing direct instruction to students.
Teachers can also be beneficiaries of GenAI for tasks related to preparation for teaching and during the actual act
of teaching. For example, GPTeach is an interactive teacher training tool that enables novice educators to practice
teaching with GPT-simulated students. Evaluations of GPTeach have shown that it can enhance teachers’ preparedness
and confidence, offering valuable practice opportunities tailored to varied teaching scenarios.
Relatedly, Tutor Copilot is a GenAI-powered system that provides real-time, expert-like guidance to tutors during live
tutoring sessions. This approach is particularly relevant for supporting students by mobilising
a less experienced workforce and addressing the issue of teacher shortages. In a randomised controlled trial, Wang
et al. evaluated the effectiveness of Tutor CoPilot. The study involved 900 tutors and 1 800 K-12 students
from historically under-served communities. Results indicated that students whose tutors had access to Tutor CoPilot
were 4 percentage points more likely to master topics, with the most significant benefits observed among students
of lower-rated tutors, who experienced a 9 percentage point improvement. Additionally, tutors using Tutor CoPilot
were more inclined to employ high-quality pedagogical strategies, such as asking guiding questions, and less likely
to provide direct answers. Although the studies with GPTeach and Tutor CoPilot show much promise, future research
is needed to understand the uptake and effectiveness of such systems in diverse educational and international
contexts. Equally important is future research to understand how effectively tutoring practices supported by Tutor
CoPilot are internalised by teachers over time as part of their professional development, and whether they may lead
to overreliance on GenAI, potentially hindering the development of teachers’ human teaching skills.
GenAI can support teachers in a range of tasks, with mixed results regarding their effectiveness and efficiency.
Although lesson planning is frequently discussed as one of the key areas of teaching practice that can benefit
from the use of GenAI, evidence about its effectiveness is still
emerging. For example, Dennison et al evaluated Shiksha Copilot, an AI- assisted lesson planning
tool deployed in schools in India. In a large-scale mixed-methods study, including interviews, surveys, and
usage logs, the study found that teachers used Shiksha Copilot to meet administrative documentation needs
and support their teaching. The use of the tool was associated with a reduction in lesson planning time, with
small to large effect sizes (Cohen’s d = 0.371 – 0.658), and lowered teaching-related stress (Cohen’s d = 0.436),
while promoting a shift toward activity-based pedagogy. However, systemic challenges, such as staffing shortages
and administrative demands, constrained broader pedagogical change. In contrast, Selwyn et al.conducted
interviews with teachers about their experiences with GenAI tools for administrative tasks in Sweden and Australia,
highlighting the significant work teachers self-report investing in reviewing, repairing, and reworking AI-generated
outputs. Their findings suggest that the promise of time-saving in AI tools may overlook the complex professional
judgments teachers must make regarding pedagogical appropriateness, social relations, and educational value.
However, Selwyn et al.’s findings are based on self-reports (i.e. interviews), which do likely not reliably
estimate time spent on technology use. Usage log analysis, on the other hand, offers a more
accurate and less biased approach. In contrast to the Dennison et al. study, which is grounded in usage
log analysis to provide more reliable usage time estimates, the Selwyn et al.study highlights the need to
account for the hidden labour of teachers that may not be captured by usage logs. Yet, given that Dennison et al. compared GenAI-supported lesson planning with a non-GenAI baseline, some of this hidden labour may
already have been reflected in their analysis. This highlights the importance of fair and contextually comparable
evaluation frameworks that consider how GenAI tools are implemented and how teacher time use is measured
across studies. Given the essential role teachers play in education, this underscores the importance of exploring
design principles, organisational adoption strategies, and the broader implications of adopting GenAI technologies
for teaching support.
One of the most pressing areas of application for GenAI in education is the provision of automated feedback.
Feedback represents a persistent challenge in higher education, where increasing student numbers are not matched
by proportional increases in teaching resources. It is also a challenge at school level in contexts where
student/teacher ratios are high, or when teachers teach a subject with few curriculum hours (and thus many classes
and students). This structural tension has made it difficult to offer timely, targeted, and individualised feedback
at scale that follows principles of effective and learner centred feedback. Feedback can improve learning progression and support development of relationships between students and educators. As shown in the remainder of this section, GenAI holds strong promise
for enabling the rapid and scalable generation of feedback across multiple modalities, with the potential to enhance
feedback scalability, quality, and even feedback literacy
GenAI has been found to be promising to offer feedback on students’ written products in higher education. In
a recent study, Dai et al. compared feedback generated by large language models to that provided by
human tutors. Their study compared feedback on readability, similarity of positive and negative points identified,
and levels feedback was provided on. The levels of feedback were grounded in Hattie and Timperley’s seminal framework that distinguishes feedback on task (correctness), process (learning strategies), self-regulation
(monitoring learning), and self (personal traits and motivation). In this framework, higher-level feedback, particularly
at the process and self-regulation levels, is widely recognised as more educationally valuable and a key indicator of
feedback quality because it supports deeper learning and learner autonomy. The Dai et al. findings revealed that
GenAI (i.e. GPT-3.5 and GPT- 4) tended to produce more readable and stylistically polished feedback with quite a
large effect size (d = 1.79) than feedback produced by human educators. This finding was somewhat unsurprising,
given that human assessors often operate under strict time constraints and offer rather succinct feedback. However,
the study also revealed limited alignment between what GenAI produced feedback and human tutors identified as
strengths and weaknesses in student work according to a rubric. Dai et al. (2024[62]) also showed that GPT (particularly
GPT-4) models were able to produce feedback that offered guidance about future choice of learning strategies (i.e.
process level feedback) in over 97% of feedback instances. Interestingly, this was higher than what was observed in
human-provided feedback which was on process level in about 80% of feedback instances. However, GPT-4 was much
less able to produce feedback on the self-regulation level, which was only in 17% of cases. However, even that was
higher than that provided by human tutors who only offered self-regulation level feedback in 11% of cases. This just
highlights the challenge of providing feedback on self-regulation levels where learners are guided to monitor their
own learning. This challenge is particularly important in the age of GenAI as discussed later.
GenAI can also be used to generate feedback guiding students based on the insights of predictive modelling.
Early prediction of students at risk of failing or dropping out have been at the core of research and practice in
learning analytics for a long time. However,
translating insights from predictive modelling to actionable feedback has received much less attention. A notable
example with much success in improving student learning and experience was the OnTask system that allowed
educators to manually write rules to generate personalised feedback based on student data.
Although much more efficient than manual feedback writing at scale, it still could not translate granular insights
of predictive modelling to actionable feedback (e.g. advice on which practice exercises to take). To address this
challenge, Liang et al. proposed an approach for transforming insights from predictive modelling to
personalised feedback with the use of GPT-4 which was rated by experienced educators as “readily applicable to
the course”and higher on readability, relational characteristics, and specificity than
human-provided feedback. However, at present, future research is needed to assess the effectiveness of such
personalised feedback on learning outcomes, student retention, and the extent to which students actually use and
act upon the AI-generated feedback.
GenAI can also be used to check the quality of feedback to promote best practices at scale. Previous research
demonstrates the potential of the use of conventional machine learning to recognise whether human produced
feedback followed established models for feedback. For example, Osakwe et al. (Osakwe et al., 2022[66]) used
a XGboost machine learning model trained on established linguistic features (e.g. cohesion or use of cognitive
words) to identify self, task, and process levels of feedback with accuracy values of 0.87, 0.82, and 0.69, respectively.
In a recent study, Aldino et al. (2024[67]) evaluated the performance of GPT-3.5 with zero-shot prompts to identify
elements of learner-centred feedback on a large dataset of feedback messages (>16k) in higher education.
GPT-3.5 showed some promising results with accuracy in the range of 0.53-0.97 across the seven attributes of learnercentred feedback. However, GPT-3.5 was consistently outperformed by conventional machine learning models (i.e.
XGBoost and Random Forest) based on linguistic features (e.g. cohesion and word count), while BERT almost always
performed reliably (accuracy 0.91-0.99) (see Box 2.1 for definition of BERT). Higher accuracy of traditional machine
learning over ChatGPT was also shown in evaluation of the quality of peer feedback. Similarly,
Dai et al. showed that GPT-4o was able to identify nine out of 10 relational characteristics of feedback with
an average accuracy exceeding 80%. For example, the model successfully recognised feedback that acknowledged
students’ strengths, offered balanced critical comments, and included actionable suggestions for improvement. Yet,
they found no significant increase in the use of few-shot prompting strategies over zero-shot prompting. These findings
suggest that while GPT prompting approaches offer a promising and accessible entry point due to their lower technical
barrier, achieving consistently high accuracy still requires conventional machine learning methods and language models
like BERT.
The differences in performance between GenAI and human educators create new opportunities to complement
and enhance the effectiveness of human tutors. For example, GenAI can provide positive and negative points
in feedback, which human educators can use as suggestions to enhance their own feedback drafts. This is also
suggested by Lu et al. who argue that GenAI can offer immediate and personalised feedback on lowerorder concerns in written products such as grammar, vocabulary, and sentence structure. Their premise is that
this may allow teachers to focus on higher-order thinking skills, content depth, and argumentation, where human
judgment remains crucial. The results of the Dai et al. study indicate that GenAI can help enhance human
feedback with effective feedback practices. This hybrid approach holds potential to
enhance efficiency without compromising pedagogical judgment and future research and practice should evaluate
its effectiveness. Despite all these promises, specialised tools that promote this hybrid approach for educators
are in early days. For example, Feedback Copilot was developed to incorporate principles of co-design to create
effective user interfaces that incorporate the use of GenAI. Efficacy of Feedback Copilot
is yet to be evaluated in practice and highlights the important research gap and direction for future research.
A growing body of research has explored how students perceive and respond to feedback generated by GenAI, particularly in comparison to human feedback. Studies have shown that students tend to act more readily on feedback from human instructors than from GenAI tools. Students often found GenAI feedback to be specific and clear though, especially in technical tasks. However, several studies also highlight concerns regarding the perceived usefulness and trustworthiness of GenAI-generated feedback. Although these studies differ in focus and methodological design, ranging from quasi-experimental evaluations of learning outcomes and randomised controlled comparisons of instructor and AI feedback to large-scale perception studies in higher education contexts, they consistently point to lower perceived usefulness and trust in AI-generated feedback relative to human feedback. Overall, current evidence suggests that while GenAI feedback can match human feedback in measurable learning outcomes, it does not replicate its pedagogical value or social credibility. For example, Escalante et al. found no significant difference in learning outcomes between students receiving feedback from GPT-4 and those receiving tutor feedback, although participants were evenly split in their perceptions of usefulness. While this might appear to suggest functional equivalence, comparable performance does not imply pedagogical interchangeability. As shown in the recent meta-analysis by Kaliisa et al. across 41 studies, AI-generated and human feedback yield statistically similar learning gains, yet students perceived human feedback as more credible and meaningful. This distinction points to the broader role of feedback in shaping motivation, evaluative judgment, and learner trust, dimensions that remain difficult for GenAI systems to reproduce even when outcome measures are equivalent. Similarly, Er et al. reported that human feedback was perceived as significantly more useful, and students who received it showed greater improvement in lab scores in Java programming. In a related study, Nazaretsky et al. found that students’ perceptions of feedback varied depending on the provider’s identity. When the feedback source was unknown, students rated AI feedback more favourably; however, when the source was revealed, they placed greater trust in human feedback. Although highly relevant to trust, the effects of hallucinations in GenAI on feedback uptake has received little attention in the literature and warrant future attention. Perceptions of fairness have also been somewhat contradictory: while some studies found that GenAI feedback was rated as fair by students, other studies observed the opposite. GenAI feedback has also shown potential to support important metacognitive processes. For instance, Tang et al. demonstrated that structured GenAI feedback on writing tasks significantly improved students’ self-assessment accuracy, which is a key skill for independent learning. However, this potential is not always realised. Jin et al. found that students with low feedback literacy engaged only minimally with a GenAI-based support tool, often due to a mismatch between the tool’s responses and their expectations. These findings suggest that the impact of GenAI feedback depends not only on its technical qualities but also on learners’ readiness to interpret and apply it effectively. As Zhan and Yan argue, fostering feedback engagement in a GenAI context requires the explicit development of students’ feedback literacy, including skills in prompt engineering, evaluative judgment, and metacognition, to facilitate deeper and more meaningful interaction with GenAI in feedback practices. Future research should aim to (a) investigate the extent to which feedback literacy of students can be promoted to more effectively and critically engage with AI-generated feedback and (b) understand whether feedback literacy enables learners to improve their learning outcomes when using AI-generated feedback.
GenAI can support generation of feedback in different modalities that goes beyond textual feedback. For example,
learning analytics offers dashboards as an alternative and cost-effective approach to provide feedback based on
analysis of student data. However, learning analytic dashboards have not achieved their full
potential. One of the main reasons for this is relatively limited data visualisation literacy of
educators and students to understand and translate insights from different statistics and charts into action. To address the limitations in visualisation literacy, GenAI can offer
two complementary approaches.
First, GenAI can provide a layer guiding educators and learners to improve their abilities to comprehend dashboards
accurately. For example, Yan et al. developed a tool called VizChat, which allows students and educators
to interact with a chatbot to help them understand the data shown in the dashboard by asking questions (Figure
2.1). When configured in a proactive mode (i.e. used scaffolding questions), VizChat significantly enhanced the
comprehension of learning analytic dashboard compared to both passive chat mode (i.e. responding to student
queries a la ChatGPT) and standalone scaffolding. Importantly, these benefits continued to persist
even when the students did not have access to proactive VizChat. Building on these promising results, future research
should investigate the extent to which learners and educators can transform insights obtained from learning analytic
dashboards into effective learning and teaching practice thanks to GenAI.
Second, GenAI can be used to generate feedback in other forms than text, for example in the form of data comics.
Data comics follow established principles of comic strip genres (e.g. Manga) and are generated by promptingmultimodal language models to generate images based on analytic insights. Data comics
were applied in simulation-based learning for healthcare professionals, where student nurses engage in highly
collaborative learning scenarios in physical spaces. Data comics (see
Figure 2.2) aim to present feedback in a more accessible, emotionally engaging format. Qualitative evidence suggests
that data comics can improve student motivation and reflective engagement; some students even reported feeling
seen or valued. However, some students in higher education found this approach as not
sufficiently professional and potentially perpetuating biases (e.g. all nurses generated in data comics were women
and the doctor was a man). Expanding positive aspects of GenAI-powered data comics, future research is needed to
understand their effectiveness across different educational contexts and levels, while minimising potential negative
effects. The same idea could also be applied to the AI generation of video clips based on the multimodal learning
analytics.
Feedback on learning processes (e.g. goal setting, strategy use, and self-monitoring) is underrepresented in existing
literature on AI in education. While learning is never fully transparent, learning
analytics has made substantial progress in visualising and interpreting the otherwise invisible dynamics of the learning
process, which can provide educators and learners with actionable insights into how learning unfolds. With advances
in learning analytics, we can now analyse fine-grained trace data such as clickstreams, mouse movements, and other
digital traces of student activity to identify cognitive, metacognitive, affective, and motivational processes. Existing research in learning analytics has also shown that such approaches can offer insights into
nuanced details about learning strategies learners used. Moreover, existing research has
shown that learning processes can explain more variance in student essay scores than linguistic essay properties
(e.g. text cohesion) that are commonly used in automated essay scoring. However, translating
insights from the underlying representations of data analytic models – e.g. process maps, networks, or descriptive
statistics – requires considerable data literacy, which can be a barrier for many educators and learners.
GenAI holds a strong potential to support feedback practices on learning process due to their ability to
combine insights from data analytic models about learning processes with instructional information and
subject matter content. By combining all these perspectives, GenAI can produce contextually relevant and
personalised learning support – e.g. feedback or scaffolds – that aim to guide learners to improve their learning
processes and performance. As outlined in
Box 2.2, LLMs can be prompted with insights of real-time analytics of processes of self-regulated learning along with
information about principles for effective feedback and relevant content information,
to generate personalised scaffolds.
The potential of process feedback has profound implications in the age of GenAI. As students can now easily use
GenAI tools to produce polished final products, it becomes increasingly important to assess how students engage
with the learning process, rather than focusing solely on the end result. Moreover, process feedback can highlight
important critical challenges - learners may face when using GenAI (e.g. metacognitive laziness and overreliance).
The transformative potential of process assessment is further discussed below
GenAI holds the promise to generate assessment items. Although GenAI can produce a wide range of content, its
use in standardised assessment requires generating items that meet psychometric standards of validity and reliability; Emerging evidence suggests this is feasible. For example,
Bhandari et al. showed that ChatGPT can generate psychometrically sound items for Algebra, while Attali
et al. demonstrated similar success for reading tasks. The work by Attali et al. underpins the
automated item generation process used in the Duolingo English Test that is a widely recognized language proficiency
exam. GenAI can be used to evaluate the quality of assessment items. Work at Duolingo emphasized the importance
of the human- in- the-loop to perform item quality review and sensitive review as part of quality assurance and
before checking for psychometric properties of the generated items. This is also aligned with
the recommendations by Moore et al. combining human judgements with LLMs to produce high quality
multiple choice questions and short answer questions.
There is growing evidence of the potential of the use of GenAI in existing assessment practices. Existing research
shows that the use of GenAI can be particularly effective when fine-tuned LLMs are used for automatic scoring of
open-ended responses, demonstrating accuracy comparable or superior to models based on conventional machine
and deep learning approaches. Latif and Zhai, for instance, showed that a fine-tuned version of GPT-3.5
significantly outperformed BERT in scoring multi-label and multi-class science education tasks, achieving up to a
10.6% accuracy improvement. Similarly, GPT-4 has shown strong alignment (Quadratic Weighted Kappa (QWK) over
0.8) with contemporary writing evaluation tools in high-stakes language assessment contexts for L2 English learners,
especially when provided with a single calibration example for each rating category. However,
Mansour et al. showed that conventional approaches dramatically overperformed ChatGPT-3.5 Turbo and
Llama2 (average QWK of 0.817 vs 0.313 and 0.201) on automatic essay scoring of English essays from the Automated
Student Assessment Prize dataset, which contains essays written in English by U.S. secondary school students in
grades 7–10 for whom English is the first language on persuasive, source-dependent, and narrative writing tasks.
The results indicate that although LLMs can potentially be useful for some types of automated scoring tasks, they
may not be for others. It is therefore important to extend the existing body of knowledge to understand the types of
tasks LLMs can be effective for to inform educational practice and policy. Likewise, educators need to be careful in
their choices of relying on GenAI for automatic scoring.
Several studies have examined the extent to which GenAI can automatically assess responses to open-ended questions
in standardised assessments and identify effective prompting strategies. For example, Rodrigues et al. evaluated GPT-4 across 738 open-ended questions drawn from high school Biology, Earth Science, and Physics tasks
categorized by Bloom’s taxonomy. The model produced high-quality responses overall, though its
performance declined on questions requiring factual recall or creative reasoning. Chan et al. analysed LLMs
in standardized STEM assessments and showed that chain-of-thought prompting significantly improved accuracy,
particularly for reasoning-intensive problems. In higher education, Moore et al. explored GPT-3’s ability
to evaluate student-generated short-answer chemistry questions in online college courses and found only modest
alignment (32-40%) with expert judgments. Together, these studies show that while GenAI can complement human
grading in structured educational contexts, its reliability still varies by domain, cognitive demand, and prompt design,
highlighting the continued need for human oversight in both item generation and scoring.
Despite the promise of generative AI, its integration into education raises fundamental questions that demand critical scrutiny. As the capacity of AI systems to automate cognitive tasks increases, it becomes imperative to interrogate not only what these technologies can accomplish but also what might be lost in the process. This section examines how prevailing assumptions about skill development and assessment are being disrupted and suggests a reorientation of educational priorities for the era of generative AI.
A central contention of this chapter is that educational systems must intentionally foster human capabilities even as they
leverage GenAI’s transformative potential. This imperative is not merely pedagogical; it is foundational to the cultivation
of human skills that will enable individuals to thrive in rapidly evolving digital environments. According to Yan, Greiff, et
al., it is important to distinguish between two interrelated dimensions in human learning when using GenAI:
AI-empowered performance and human skill development (Figure 2.4). The first dimension (vertical in Figure 2.4)
is focused on development of human skills – i.e. human learning. This dimension has traditionally been covered in
education including educational research in AI in education and the application of GenAI to support human learning
are covered in the previous section on "Existing Practices". However, the ubiquitous presence of GenAI changes the
context in which learning happens. This is why we also consider the second dimension (horizontal in Figure 2.4), which
concerns the extent to which individuals use AI tools, such as large language models, to enhance task execution and
produce high-quality outputs. In the remainder of this section, we consider the implications of these two dimensions
on human skill development according to evidence emerging from the existing literature.
The intersection of these dimensions defines the horizon toward which education should strive: learners who combine
strong independent skills with the effective, reflective use of AI augmentation. However, a growing body of evidence
suggests that this aspiration is not easily achieved and that the introduction of generative AI can create a "mirage
of false mastery," where high-quality, AI-enabled output conceals underlying weaknesses in human skill – i.e. the
undesirable curve in Figure 2.4 where task performance does not correlate with learning.
While generative AI has shown promise in supporting various educational tasks, its effectiveness in fostering longterm skill development remains uncertain. An important study in this space is conducted by Darvishi and colleagues who investigated the extent to which an AI support tool could extend student ability to provide effective
peer feedback. The GenAI tool was designed to support students in generating feedback more effectively, rather than
directly improving the content of their responses. In a large-scale randomised controlled trial with approximately
1 600 students, Darvishi et al. observed that while initial AI-supported gains in peer feedback quality were
significant, these gains were not sustained once the tool was withdrawn. Students did not retain the feedback skills that
appeared to have been acquired with AI support. Moreover, there was no robust evidence of synergistic development
of human and AI-empowered skills; students generally exhibited strength in either AI-assisted performance or
independent skill, but rarely both. These findings are echoed by a systematic review and meta-analysis by Vaccaro et.
al. which analysed 106 experimental studies of human-AI collaboration. The meta-analysis found that, on average, human-AI combinations performed worse than the best of either humans alone or AI alone, especially in
decision-making tasks. This cautions against the assumption that human-AI synergy will naturally emerge. The risk in
many educational contexts is that generative AI either simply augments current abilities or, more problematically in
an education content, substitutes for human effort without fostering genuine skill development.
This pattern of substitution is often driven by external pressures. Research by Abbas et al. revealed that
university students were more likely to use ChatGPT when facing a high academic workload and time pressure.
Their study, involving nearly 500 students, found that this utility came at a cost; increased use of ChatGPT was
correlated with higher levels of procrastination, self-reported memory loss, and ultimately, diminished academic
performance. Such findings suggest that students may turn to GenAI not as a partner for learning but as a tool to
manage overwhelming demands, leading to unintended negative consequences. Furthermore, this substitution can
foster an uncritical over-reliance on AI. In a systematic review, Zhai et al. investigated how over-reliance
on AI dialogue systems affects students’ cognitive abilities. They define over-reliance as the uncritical acceptance of
AI-generated recommendations, a tendency that arises when individuals struggle to assess the trustworthiness of
the tool. Their findings indicate that this behaviour encourages the use of cognitive shortcuts, favouring fast, efficient
answers over slow, effortful reasoning. This preference undermines the development of essential cognitive abilities,
including decision-making, analytical reasoning, and critical thinking. It is crucial, therefore, to resist the temptation
to conflate AI-augmented performance with authentic competence or deep learning.
Another dimension of concern relates to the effects of the use of GenAI on learning processes. There
is accumulating evidence that increased reliance on GenAI tools can suppress students’ engagement in
self-monitoring (defined as the ongoing process of checking, regulating, and adjusting one’s understanding and
strategies during learning), reflection, and evaluative judgement of one’s learning processes, processes that are
fundamental to autonomous learning. When GenAI is used as a shortcut rather than as a scaffold
that promotes learning, students may defer cognitive effort to technology, thereby weakening the very skills that
underlie deep learning.
Empirical research has begun to quantify risks of GenAI on reduced human cognition and metacognition. In a study
comparing the use of ChatGPT to traditional search engines for a scientific inquiry task, Stadler et al. found that students using the large language model experienced a significantly lower cognitive load. However,
this cognitive ease came at a cost: these students produced lower-quality reasoning and argumentation in their
final recommendations compared to the group using the Google search engine. This highlights a critical tradeoff, suggesting that while LLMs can reduce the cognitive burden of information gathering, they may not promote
the deeper cognitive engagement necessary for high-quality learning. This finding is reinforced by a randomised
experimental study by Fan et al., which compared university students’ writing processes when supported
by ChatGPT, a human expert, a writing analytics tool, or no additional support. While the ChatGPT-supported group
showed greater improvements in essay scores, these gains did not translate into deeper knowledge acquisition
or transfer (as measured by knowledge transfer test on different topics). More importantly, the study found that
learners in the AI-supported group demonstrated a marked reliance on the technology and were less likely to engage
in metacognitive activities such as self-monitoring and reflection, a phenomenon the authors term metacognitive
laziness.
The impact of AI on self-directed learning is further complicated by students’ motivations for using these tools. A yearlong longitudinal study by Xie et al. examined how interaction frequency with chatbots affected learning
autonomy. The results were nuanced: for learners seeking virtual companionship, the social presence fostered by the
AI had a positive mediating effect on their learning autonomy. Conversely, for learners focused purely on knowledge
acquisition, more frequent interaction with the chatbot was negatively correlated with both social presence and
learning autonomy. This indicates that the effect of AI interaction is not uniform and that frequent use for instrumental
purposes may undermine the development of independent learning habits.
These findings illustrate a crucial distinction: apparent improvements in performance enabled by generative AI may
mask deficits in learners’ underlying cognitive and metacognitive processes. However, this does not mean AI cannot
play a productive role in learning. When structured intentionally within a collaborative learning environment, AI can
act as a powerful scaffold. For instance, An et al. studied student teachers using a mind-mapping tool
integrated with GenAI. The groups using the AI tool not only outperformed the control groups on their collaborative
tasks but also demonstrated a more sophisticated knowledge construction process, moving progressively from
individual ideas to peer interaction and group synthesis.
As students increasingly use GenAI tools for learning, traditional assessment models that focus solely on final outputs
are becoming inadequate. When high-quality products can be produced with minimal engagement in the learning
process, assessment risks measuring technological proficiency rather than human skill or understanding. To address
this challenge, there is a pressing need to reorient assessment practices towards process-oriented approaches that
evaluate not just what students produce, but how they engage with learning to create products. Assessments should
aim to capture the processes students use to plan, monitor, and adapt their work, thereby revealing the authenticity
and depth of their learning in GenAI-rich environments. Only by prioritising cognitive and metacognitive engagement
alongside product quality can educational systems ensure that AI augments, rather than supplants, the development
of meaningful human expertise.
One promising way to operationalise this shift is through evidence-centred assessment design (ECD). The ECD framework provides a principled model for linking assessment tasks, evidence, and inferences
about learners’ knowledge and skills. By moving beyond a narrow focus on final outputs, ECD enables the design of
multidimensional assessments that capture both product and process evidence.
An illustrative example of this process-oriented approach comes from recent work in medical education, where clinical
reasoning tasks have been redesigned to capture a more holistic view of learning. Drawing on
the ECD framework, this approach moves beyond assessing only the final diagnostic conclusion. Instead, it builds a
multidimensional evidence model by collecting three streams of data as students interact with GenAI-powered virtual
patients: product evidence (e.g. diagnostic accuracy), process evidence (e.g. conversation logs where students do
history taking), and metacognitive evidence (e.g. clickstream data and interaction logs). Analysis of this rich data reveals
that integrating all three evidence sources provides a significantly more reliable prediction of learner performance
than relying on product-based measures alone. Notably, process data emerged as the strongest standalone predictor,
underscoring the value of assessing the “how” of learning, not just the “what.
Building on this ECD foundation, predictive models are now being paired with Explainable AI (XAI) to make the
assessment process not only accurate but pedagogically meaningful. Simply predicting performance with a “blackbox” machine learning model is insufficient for supporting learning. To make insights actionable, the XAI layer
identifies the key factors influencing a prediction. These technical explanations are then
translated by a GenAI system into structured, personalised, and pedagogically relevant feedback for the learner.
This hybrid XAI–GenAI approach ensures that feedback is aligned with self-regulated learning principles, helping
students understand not only their performance but also the cognitive and metacognitive strategies that shaped
it. By grounding feedback in specific evidence from the learning process, this approach extends the ECD model
beyond assessment design to feedback delivery, providing transparent, actionable guidance that fosters genuine
skill development.
Figure 2.5 demonstrates how this assessment approach is implemented in the FLoRA platform for history-taking skills
as part of the development of clinical reasoning in medical education. Learners first interact with
the virtual standard patients Figure 2.5 A, which are also based on GPT. Once the learner completes the interaction
with the virtual standard patient and submits their diagnosis (Figure 2.5 B), the system applies the evidence model
and generates personalised feedback (Figure 2.5 C)
The methodological rigour of research on generative AI in education is critical to produce quality evidence. If we are
to make sound, evidence-informed decisions, it is essential to move beyond the commentaries and hype cycle and
uphold high standards of empirical inquiry. As also indicated in the previous subsection and Figure 2.4, a central
challenge in producing robust empirical evidence about effects of GenAI on human skills is the pervasive conflation
of performance with learning. Performance refers to the observed performance of a task, whereas
learning involves an enduring change in knowledge and skills that is demonstrated through retention and transfer. The distinction is essential; high performance, especially when mediated by a
powerful tool, does not imply that learning has occurred.
A second, related but distinct issue is the media/methods fallacy. For decades, researchers have
cautioned against simplistic “media comparison studies” that attribute learning gains to a technology itself, rather
than to the specific instructional methods it enables. Much of the nascent research on generative AI repeats this
error, comparing an ill-defined “ChatGPT condition” with a control group and concluding that the technology “works”.
Such designs may demonstrate that a particular arrangement (e.g. students working with ChatGPT) can yield different
outcomes than another (e.g. students working alone). However, because they attribute effects to the technology as
a whole rather than to the specific instructional processes it affords, these studies provide limited insight into the
underlying mechanisms. This limits their explanatory power and risks conflating performance support with genuine
learning .
A further methodological weakness, distinct from but often co-occurring with the media/methods fallacy, is the
conflation of task performance with learning. For instance, meta-analyses claiming that ChatGPT enhances “academic
performance” often measure immediate task achievement, not durable learning, and sometimes not even learning. While students may produce a better essay or translation with AI assistance, this performance
gain may mask a lack of underlying cognitive engagement and learning gains. As discussed earlier, offloading effort
to AI can reduce cognitive load but also risks fostering “metacognitive laziness”, thereby undermining the very
processes required for deep skill development. This problem
is amplified by the “fast science” culture, where sensational claims, such as GPT-4 “acing” the MIT curriculum gain traction despite significant methodological flaws, including data contamination and a lack of
transparent verification. Even if such claims were accurate, they would have limited
educational meaning, as GenAI outperforming standardised or benchmark tasks does not come with conceptual
understanding and does not imply that the underlying processes of learning and transfer by humans should be
abandoned. The danger lies in conflating technical proficiency with educational value, which can distort expectations
and policy directions and fuel the kind of policy-practice misalignment that has characterised many AI-in-education
debates.
To build a robust evidence base, the field must adopt a more rigorous research agenda. First, researchers must
explicitly differentiate learning from performance by incorporating process-oriented assessments, such as delayed
retention and knowledge transfer tests, into their designs. Second, studies must move beyond
media comparisons to isolate causal mechanisms, clearly defining the pedagogical function of the AI intervention,
much like the decades of theory-driven research on Intelligent Tutoring Systems. Finally, we must
prioritise longitudinal research that tracks the durable effects of AI interaction on students’ knowledge, skills, and
dispositions over time.
For policymakers and funding organisations, this highlights a critical need to guide future investment. To build a robust
evidence base, funding should prioritise longitudinal studies that track durable skills, demand that interventions
clearly specify their pedagogical underpinnings and support the development of process-oriented assessments. Only
by investing in research that distinguishes task performance from learning can we ensure that technology serves our
ultimate goal: fostering deep and lasting human competence.
In conclusion, our findings underscore the significant promise of GenAI in enhancing educational practices related to
learning and assessment. Specifically, we have demonstrated that GenAI-powered systems can directly support both
students and educators, streamlining teaching activities and providing targeted assistance. However, despite these
promising developments, our analysis also reveals several critical caveats that must be carefully considered when
informing future practice and policy in this area. One such concern is the need for careful attention to the design
of teaching practices that enable the effective use of GenAI, particularly in systems designed to directly support
students, such as tutoring chatbots. For example, recent studies have shown that combining GenAI with established
instructional methods (e.g. scaffolding), where GenAI agents guide students through step-by-step reasoning, can
foster genuine learning and sustained performance enhancement even after removal of GenAI support. By contrast, unguided “answer-giving” practices, where students simply request solutions from a chatbot,
have been found to undermine reflection and suppress metacognitive engagement. As emerging evidence suggests, not all students may benefit equally from these systems.
Therefore, it is essential to consider how different student subpopulations, based on factors such as socio-economic
status or prior academic achievement, interact with these technologies. By identifying the specific conditions under
which these subgroups may benefit from GenAI-powered systems, we can mitigate potential inequalities and ensure
that AI tools support a diverse range of learners.
Moreover, while GenAI offers the potential for rapid development of tools like tutoring chatbots, one should recognise
that on average general-purpose LLMs (e.g. off-the-shelf GPT systems) do not yet match the effectiveness of traditional
intelligent tutoring systems, when not designed or finetuned with adequate pedagogical knowledge. Hybrid systems that embed GenAI within educationally grounded frameworks may show more
promise, but the evidence base remains limited. We still also need evidence that will compare how GenAI-powered
tutors compare to their conventional counterparts in terms of their ability to provide sustained, long-term learning
support. As a result, future research needs to investigate whether GenAI-based tutoring systems can effectively
support learners over extended periods. In addition, there is considerable potential for GenAI to complement existing
intelligent tutoring systems by enhancing their interactivity and enabling more natural language communication,
which could ultimately create more personalised learning experiences. Future research should focus on exploring
how GenAI can be integrated into intelligent tutoring systems, drawing on well-established educational principles to
enhance these systems’ overall efficacy.
Despite the promising capabilities of GenAI to generate high-quality feedback, research has shown that students’
trust in AI-generated feedback varies considerably across contexts. In some studies, students respond positively and
perceive such feedback as clear and useful, while in others they express scepticism about its accuracy or relevance.
This variability in trust can influence whether learners engage with GenAI feedback, which in turn affects its potential
impact on learning. To fully realise the benefits of GenAI in supporting feedback processes, future work should focus
on developing teaching practices that help integrate AI-generated feedback effectively into classroom use. One
promising direction is to use GenAI as a tool to help educators reflect on and refine their feedback (i.e. human or
AI-generated) by checking whether it is clear, balanced, and aligned with established feedback principles before it
reaches students.
GenAI shows promise in supporting educators with their daily teaching and administrative tasks. Although existing
evidence grounded in more reliable measures of time spent with technology, such as usage log analysis, shows
increases in efficiency for some tasks like lesson planning, qualitative studies highlight potential «blind spots» in
these estimates that warrant further research. Specifically, the hidden labour educators must invest in reviewing and
verifying the accuracy of AI-generated content may not be fully accounted for when relying solely on usage logs,
unless they do the revisions online. All this emphasises the need for further research into how the GenAI tools can
be designed to enhance, rather than complicate, teaching practices.
In the realm of assessment, GenAI offers valuable opportunities to streamline the creation of assessments and
automate scoring processes. Its potential has been demonstrated in large-scale standardised tests, such as the
Duolingo English Test, where GenAI can assists in generating items that meet psychometric standards. However this level of psychometric rigor is rarely required or feasible in everyday classroom assessments. While the widely
available LLMs can help teachers design questions or tasks more efficiently, its outputs still require human review
to ensure pedagogical relevance, truthfulness, fairness, and alignment with learning goals. Likewise, prompt-based
approaches to automated scoring, although more accessible to non-technical users, remain less reliable than finetuned or conventional machine learning models. Educators should therefore treat GenAI tools as a complementary
aid rather than a substitute for human judgment, validating its outputs for clarity and appropriateness before
classroom use. Future research should focus on developing practical frameworks that help teachers integrate GenAItools into formative and summative assessment processes responsibly, combining the efficiency of automation with
the interpretive expertise of educators. To assist educators, future work needs to focus on developing classroom
assessment strategies that incorporate GenAI in meaningful ways, expanding its applicability while also ensuring that
it enhances the overall learning and teaching experience for both educators and students.
Finally, our study highlights a critical risk: the uncritical adoption of GenAI may inadvertently undermine the
development of key human skills such as critical thinking, metacognition, and evaluative judgment, all of which are
foundational to genuine expertise. This could result in what we describe as the “mirage of false mastery,” where the
impressive outputs generated by AI mask the underdevelopment of essential skills, including hybrid human-AI skills(Box 2.3). The path forward, therefore, is not a rejection of technology, but a commitment to pedagogical intentionality
and methodological rigor. Rather than simply asking whether GenAI “augments students’ task performance,” we
must focus on how it can be used to foster deep, meaningful, and durable learning. This means reorienting our
focus from GenAI-driven products to human-centred processes, ensuring that GenAI tools are designed to scaffold
rather than supplant human thinking. By prioritising the development of durable, transferable skills and integrating
metacognitive awareness into both learning and assessment, we can unlock the transformative potential of GenAI,
creating an educational future that is not only more efficient but also authentically human.
Comments
Post a Comment