Generative AI for human skill development and assessment: Implications for existing practices and new horizons.

Generative artificial intelligence (GenAI) is transforming the landscape of education by reshaping how skills are developed, assessed, and supported. This section synthesises recent empirical evidence on how GenAI technologies influence instructional practices, feedback, and assessment. It examines both the opportunities and limitations of using GenAI to provide personalised tutoring, enhance feedback quality, and automate assessment practices. The chapter argues for a careful balance between human skill development and AI-augmented performance, emphasising the need for pedagogically grounded integration of GenAI within intelligent tutoring and assessment frameworks. It concludes by outlining directions for research and policy that ensure GenAI strengthens, rather than substitutes, human learning and instructional expertise.

The wide adoption of generative artificial intelligence (GenAI) – after the public release of ChatGPT in November 2022 – has triggered profound debates about their implications on education. GenAI can provide technologies that can support skill acquisition through personalised instruction and feedback, and enhance the efficiency and effectiveness of teaching practices. However, GenAI poses ethical challenges and risks as well. The developments in GenAI triggered educators, education leaders, and policymakers to engage with GenAI extensively, rethink pedagogical, assessment, and governance frameworks to harness GenAI’s potential while mitigating its risks. Through these efforts, many education institutions have produced policies and guidelines to support staff and students in using generative AI. Similarly, many government, intergovernmental, nongovernmental, and non-for-profit organisations have also produced documents that inform GenAI adoption, responsible practices, and frameworks for professional development of educators. Equally so, the rapid developments in GenAI have also mobilised many researchers to study implications on education and human learning. This section aims to summarise recent evidence about the implications of GenAI in human skill developmentand assessment. The focus will be on human skill development and assessment as they are central to education and professional development programs. The analysis of the implications of GenAI on human skill development and assessment is particularly framed around two complementary perspectives. First, GenAI technologies offer some promising prospects for advancing our existing practices related to skill development and assessment. For example, GenAI can be used to provide interactive instructional support, provide personalised feedback at scale, and automate the creation and implementation of assessments. Second, GenAI challenges our existing assumptions of our learning practices and calls for novel ways for assessment. For example, while GenAI can increase performance in certain situations, it can also limit human agency and result in overreliance on AI. Finally, we need to strengthen research methods that are used to study human skill development and assessment in the age of GenAI to avoid challenges recently noted in the literature, such as conflation of learning and performance. This section is based on the analysis of empirical evidence published in the research literature. It offers a summary of the existing evidence about effectiveness of GenAI to support existing practices for instruction, feedback, and assessment given their central roles in education and professional development. It also describes the recent conceptualisation of hybrid human-AI skills that recognises the need to support development of human skills while enhancing task performance with the use of GenAI. The section concludes by providing implications for practice and policy grounded in existing evidence and promising directions for future research. Box 2.1 provides a glossary of the main terms and types of generative AI and associated techniques in the field of AI in education (AIED).

Glossary of the main terms and types of generative AI and associated techniques in the field of AI in education (AIED).

Providing enhanced instructional support at scale is one of the most prominent areas for the use of GenAI in education. This is grounded in the idea of making use of GenAI for developing systems that can offer personalised learning support. The idea of personalised learning support is grounded in Bloom’s “two‑sigma problem” showing the significant benefits of one-to-one instruction over other forms of instruction. Before widespread use of GenAI, the effectiveness of personalised learning support has been long studied in the literature on artificial intelligence in education, particularly focusing on intelligent tutoring systems and resulted in the development of many effective tutoring systems – e.g. SQL-Tutor, MetaTutor, andCognitive Tutors. Especially relevant to today’s attempts to provide personalised learning support are intelligent tutoring systems such as AutoTutor and BEETLE that were already designed to provide tutoring through dialogue in natural language. This research also informed the development of many commercial tutoring systems such as MATHia based on Cognitive Tutor and ALEKS. However, rapid development of such systems still remains a challenge. GenAI offers promising approaches that can be used for rapid development of instructional systems for personalised support. Specifically, GenAI, through the use of large language models, can be leveraged to develop tutoring chatbots. A prominent example is Khan Academy’s Khanmigo chatbot, which makes use of large language models to conduct scaffolded, Socratic‐style tutoring across diverse subject areas. As one of many emerging GenAI chatbots in education, Khanmigo illustrates how these technologies can scale personalised learning support and expand opportunities for learner autonomy and exploration. However, at the time of writing of this chapter, there have been no studies published that evaluated the effectiveness of Khanmigo on learning (although at least one preregistered RCT is ongoing in Canada) Evidence on the effectiveness of GenAI to enhance instructional support is still emerging and offers mixed support. For example, a randomised controlled trial conducted at Harvard University showed the significant effects (0.73-1.3 standard deviations) of an AI Tutor – ChatGPT powered system – over those attending in-person active learning classes in an undergraduate physics course. The World Bank has recently reported the findings of a randomised controlled trial in nine secondary schools in Nigeria. In the trial, students were randomised in the treatment group that received access to Microsoft Copilot based on GPT-4 in an after-school programme and the control group who did not have access. The students in the treatment group received teacher instructions on how to use Copilot including the prompts and worked in pairs with other students. The results showed the positive effects of this intervention with the effect size of 0.31 standard deviations. However, this effect size was lower than the average effect size noted in the meta-analyses of the effectiveness for intelligent tutoring systems – i.e. 0.42–0.57 standard deviations according to Ma et al. and 0.66 according to Kulik and Fletcher. This suggests that past AI tutors might be more effective, albeit the difference in context. Nevertheless, the World Bank study findings aligns with the range observed in promising computer-assisted learning interventions reviewed by Escueta et al., who identified effect sizes between 0.18 and 0.63 standard deviations for personalised and adaptive programs, particularly in mathematics. The World Bank study also showed that the students with high prior academic performance and high socio-economic status particularly benefited from the interventions. While these findings come from a context where socio-economic disparities are likely more pronounced than in most OECD countries, they still suggest that GenAI-based tutoring systems may disproportionately benefit certain groups of students. Future research in diverse educational settings is needed to corroborate this pattern. The way how a GenAI-based tutoring system is configured and used may have profound implications on learning. In a large-scale field experiment in high school math classrooms, Bastani et al. found that while GPT-4-based tutors improved performance during use (up to 127%), students who used a standard chatbot akin to ChatGPT performed worse (17% lower performance) than the control group once access to the chatbot was removed. The control group students did not use any GenAI-based instructional support in addition to the conventional classroom instruction. This negative effect of GenAI use was mitigated by a version designed with learning safeguards, suggesting that poorly configured systems may undermine long-term learning. Similarly, Lehmann et al. showed that a ChatGPT-based tutor for Python programming had no overall effect on learning, but its impact depended on usage patterns. Students who heavily relied on the ChatGPT-based tutor tended to cover a broader range of optics but developed shallower understanding, while those who used it to complement learning gained deeper understanding. The use of the ChatGPT- based tutor also widened performance gaps between students with high and low prior knowledge. In summary, the Bastani et al. and Lehmann et al. studies highlighted the importance of instructional strategies embedded in the design of GenAI-based instructional systems and the way how students use them are two key factors that need to be considered in research and practice. Future research should also examine how best to combine generative and conventional AI models, since, adding LLMs investigate effectiveness of different pedagogical approaches GenAI-based instructional systems use and factors (e.g. metacognitive skills) that explain different usage patterns of students. It is also important to examine how conventional and generative AI models can be effectively integrated. Such integration should enhance human learning by combining the strengths of each layer while minimising risks, including hallucinations that may compromise reliability. Effective instructional support from GenAI-powered systems requires alignment with the proven experience in developing intelligent tutoring systems. Although frequently mentioned in recent developments, GenAI-powered instructional systems do not follow a typical architecture of intelligent tutoring systems. At the core of intelligent tutoring systems are learner models, tutor models, domain models, and user interface. Most GenAI–based instructional systems primarily make use of large language models to cover functions of all these four components. While user interfaces through natural language interaction can be quite advanced with LLMs, the support for the other three components is less obvious. Although the functions of the domain and tutor models can be performed by LLMs to some extent, there is presently limited research and evidence on how their quality can be assured. Specifically, due to the stochastic nature of LLMs, they cannot guarantee reliability of information covered in the domain model due to tendency to hallucinate (Ji et al., 2023[41]). Existing research shows that LLMs can easily be distracted and inconsistently comply with the instructions provided in the underlying prompts. Therefore, future research is needed to assess the extent to which LLMs can consistently comply with a particular tutoring strategy to offer long-term effects. Moreover, future work is needed to develop effective computational approaches that can increase compliance of LLMs to promote effective tutoring strategies over time. Finally, there is very little evidence in the literature that existing GenAI-based instructional systems offer any learner models. They are precisely needed to understand individual student needs based on the tracing of their knowledge development and learning approaches they take to provide adaptive and personalised support. Recent evidence by Borchers and Shou shows that LLM-only tools offer only limited adaptivity compared to conventional intelligent tutoring systems. Future research is needed to address these critical architectural needs and identify effective ways of the integration of LLMs within tutoring architectures to enhance instructional effectiveness.

Enhancing instructional support does not necessarily need to be done through providing direct instruction to students. Teachers can also be beneficiaries of GenAI for tasks related to preparation for teaching and during the actual act of teaching. For example, GPTeach is an interactive teacher training tool that enables novice educators to practice teaching with GPT-simulated students. Evaluations of GPTeach have shown that it can enhance teachers’ preparedness and confidence, offering valuable practice opportunities tailored to varied teaching scenarios. Relatedly, Tutor Copilot is a GenAI-powered system that provides real-time, expert-like guidance to tutors during live tutoring sessions. This approach is particularly relevant for supporting students by mobilising a less experienced workforce and addressing the issue of teacher shortages. In a randomised controlled trial, Wang et al. evaluated the effectiveness of Tutor CoPilot. The study involved 900 tutors and 1 800 K-12 students from historically under-served communities. Results indicated that students whose tutors had access to Tutor CoPilot were 4 percentage points more likely to master topics, with the most significant benefits observed among students of lower-rated tutors, who experienced a 9 percentage point improvement. Additionally, tutors using Tutor CoPilot were more inclined to employ high-quality pedagogical strategies, such as asking guiding questions, and less likely to provide direct answers. Although the studies with GPTeach and Tutor CoPilot show much promise, future research is needed to understand the uptake and effectiveness of such systems in diverse educational and international contexts. Equally important is future research to understand how effectively tutoring practices supported by Tutor CoPilot are internalised by teachers over time as part of their professional development, and whether they may lead to overreliance on GenAI, potentially hindering the development of teachers’ human teaching skills. GenAI can support teachers in a range of tasks, with mixed results regarding their effectiveness and efficiency. Although lesson planning is frequently discussed as one of the key areas of teaching practice that can benefit from the use of GenAI, evidence about its effectiveness is still emerging. For example, Dennison et al evaluated Shiksha Copilot, an AI- assisted lesson planning tool deployed in schools in India. In a large-scale mixed-methods study, including interviews, surveys, and usage logs, the study found that teachers used Shiksha Copilot to meet administrative documentation needs and support their teaching. The use of the tool was associated with a reduction in lesson planning time, with small to large effect sizes (Cohen’s d = 0.371 – 0.658), and lowered teaching-related stress (Cohen’s d = 0.436), while promoting a shift toward activity-based pedagogy. However, systemic challenges, such as staffing shortages and administrative demands, constrained broader pedagogical change. In contrast, Selwyn et al.conducted interviews with teachers about their experiences with GenAI tools for administrative tasks in Sweden and Australia, highlighting the significant work teachers self-report investing in reviewing, repairing, and reworking AI-generated outputs. Their findings suggest that the promise of time-saving in AI tools may overlook the complex professional judgments teachers must make regarding pedagogical appropriateness, social relations, and educational value. However, Selwyn et al.’s findings are based on self-reports (i.e. interviews), which do likely not reliably estimate time spent on technology use. Usage log analysis, on the other hand, offers a more accurate and less biased approach. In contrast to the Dennison et al. study, which is grounded in usage log analysis to provide more reliable usage time estimates, the Selwyn et al.study highlights the need to account for the hidden labour of teachers that may not be captured by usage logs. Yet, given that Dennison et al. compared GenAI-supported lesson planning with a non-GenAI baseline, some of this hidden labour may already have been reflected in their analysis. This highlights the importance of fair and contextually comparable evaluation frameworks that consider how GenAI tools are implemented and how teacher time use is measured across studies. Given the essential role teachers play in education, this underscores the importance of exploring design principles, organisational adoption strategies, and the broader implications of adopting GenAI technologies for teaching support.

One of the most pressing areas of application for GenAI in education is the provision of automated feedback. Feedback represents a persistent challenge in higher education, where increasing student numbers are not matched by proportional increases in teaching resources. It is also a challenge at school level in contexts where student/teacher ratios are high, or when teachers teach a subject with few curriculum hours (and thus many classes and students). This structural tension has made it difficult to offer timely, targeted, and individualised feedback at scale that follows principles of effective and learner centred feedback. Feedback can improve learning progression and support development of relationships between students and educators. As shown in the remainder of this section, GenAI holds strong promise for enabling the rapid and scalable generation of feedback across multiple modalities, with the potential to enhance feedback scalability, quality, and even feedback literacy

GenAI has been found to be promising to offer feedback on students’ written products in higher education. In a recent study, Dai et al. compared feedback generated by large language models to that provided by human tutors. Their study compared feedback on readability, similarity of positive and negative points identified, and levels feedback was provided on. The levels of feedback were grounded in Hattie and Timperley’s seminal framework that distinguishes feedback on task (correctness), process (learning strategies), self-regulation (monitoring learning), and self (personal traits and motivation). In this framework, higher-level feedback, particularly at the process and self-regulation levels, is widely recognised as more educationally valuable and a key indicator of feedback quality because it supports deeper learning and learner autonomy. The Dai et al. findings revealed that GenAI (i.e. GPT-3.5 and GPT- 4) tended to produce more readable and stylistically polished feedback with quite a large effect size (d = 1.79) than feedback produced by human educators. This finding was somewhat unsurprising, given that human assessors often operate under strict time constraints and offer rather succinct feedback. However, the study also revealed limited alignment between what GenAI produced feedback and human tutors identified as strengths and weaknesses in student work according to a rubric. Dai et al. (2024[62]) also showed that GPT (particularly GPT-4) models were able to produce feedback that offered guidance about future choice of learning strategies (i.e. process level feedback) in over 97% of feedback instances. Interestingly, this was higher than what was observed in human-provided feedback which was on process level in about 80% of feedback instances. However, GPT-4 was much less able to produce feedback on the self-regulation level, which was only in 17% of cases. However, even that was higher than that provided by human tutors who only offered self-regulation level feedback in 11% of cases. This just highlights the challenge of providing feedback on self-regulation levels where learners are guided to monitor their own learning. This challenge is particularly important in the age of GenAI as discussed later. GenAI can also be used to generate feedback guiding students based on the insights of predictive modelling. Early prediction of students at risk of failing or dropping out have been at the core of research and practice in learning analytics for a long time. However, translating insights from predictive modelling to actionable feedback has received much less attention. A notable example with much success in improving student learning and experience was the OnTask system that allowed educators to manually write rules to generate personalised feedback based on student data. Although much more efficient than manual feedback writing at scale, it still could not translate granular insights of predictive modelling to actionable feedback (e.g. advice on which practice exercises to take). To address this challenge, Liang et al. proposed an approach for transforming insights from predictive modelling to personalised feedback with the use of GPT-4 which was rated by experienced educators as “readily applicable to the course”and higher on readability, relational characteristics, and specificity than human-provided feedback. However, at present, future research is needed to assess the effectiveness of such personalised feedback on learning outcomes, student retention, and the extent to which students actually use and act upon the AI-generated feedback.

GenAI can also be used to check the quality of feedback to promote best practices at scale. Previous research demonstrates the potential of the use of conventional machine learning to recognise whether human produced feedback followed established models for feedback. For example, Osakwe et al. (Osakwe et al., 2022[66]) used a XGboost machine learning model trained on established linguistic features (e.g. cohesion or use of cognitive words) to identify self, task, and process levels of feedback with accuracy values of 0.87, 0.82, and 0.69, respectively. In a recent study, Aldino et al. (2024[67]) evaluated the performance of GPT-3.5 with zero-shot prompts to identify elements of learner-centred feedback on a large dataset of feedback messages (>16k) in higher education. GPT-3.5 showed some promising results with accuracy in the range of 0.53-0.97 across the seven attributes of learnercentred feedback. However, GPT-3.5 was consistently outperformed by conventional machine learning models (i.e. XGBoost and Random Forest) based on linguistic features (e.g. cohesion and word count), while BERT almost always performed reliably (accuracy 0.91-0.99) (see Box 2.1 for definition of BERT). Higher accuracy of traditional machine learning over ChatGPT was also shown in evaluation of the quality of peer feedback. Similarly, Dai et al. showed that GPT-4o was able to identify nine out of 10 relational characteristics of feedback with an average accuracy exceeding 80%. For example, the model successfully recognised feedback that acknowledged students’ strengths, offered balanced critical comments, and included actionable suggestions for improvement. Yet, they found no significant increase in the use of few-shot prompting strategies over zero-shot prompting. These findings suggest that while GPT prompting approaches offer a promising and accessible entry point due to their lower technical barrier, achieving consistently high accuracy still requires conventional machine learning methods and language models like BERT. The differences in performance between GenAI and human educators create new opportunities to complement and enhance the effectiveness of human tutors. For example, GenAI can provide positive and negative points in feedback, which human educators can use as suggestions to enhance their own feedback drafts. This is also suggested by Lu et al. who argue that GenAI can offer immediate and personalised feedback on lowerorder concerns in written products such as grammar, vocabulary, and sentence structure. Their premise is that this may allow teachers to focus on higher-order thinking skills, content depth, and argumentation, where human judgment remains crucial. The results of the Dai et al. study indicate that GenAI can help enhance human feedback with effective feedback practices. This hybrid approach holds potential to enhance efficiency without compromising pedagogical judgment and future research and practice should evaluate its effectiveness. Despite all these promises, specialised tools that promote this hybrid approach for educators are in early days. For example, Feedback Copilot was developed to incorporate principles of co-design to create effective user interfaces that incorporate the use of GenAI. Efficacy of Feedback Copilot is yet to be evaluated in practice and highlights the important research gap and direction for future research.

A growing body of research has explored how students perceive and respond to feedback generated by GenAI, particularly in comparison to human feedback. Studies have shown that students tend to act more readily on feedback from human instructors than from GenAI tools. Students often found GenAI feedback to be specific and clear though, especially in technical tasks. However, several studies also highlight concerns regarding the perceived usefulness and trustworthiness of GenAI-generated feedback. Although these studies differ in focus and methodological design, ranging from quasi-experimental evaluations of learning outcomes and randomised controlled comparisons of instructor and AI feedback to large-scale perception studies in higher education contexts, they consistently point to lower perceived usefulness and trust in AI-generated feedback relative to human feedback. Overall, current evidence suggests that while GenAI feedback can match human feedback in measurable learning outcomes, it does not replicate its pedagogical value or social credibility. For example, Escalante et al. found no significant difference in learning outcomes between students receiving feedback from GPT-4 and those receiving tutor feedback, although participants were evenly split in their perceptions of usefulness. While this might appear to suggest functional equivalence, comparable performance does not imply pedagogical interchangeability. As shown in the recent meta-analysis by Kaliisa et al. across 41 studies, AI-generated and human feedback yield statistically similar learning gains, yet students perceived human feedback as more credible and meaningful. This distinction points to the broader role of feedback in shaping motivation, evaluative judgment, and learner trust, dimensions that remain difficult for GenAI systems to reproduce even when outcome measures are equivalent. Similarly, Er et al. reported that human feedback was perceived as significantly more useful, and students who received it showed greater improvement in lab scores in Java programming. In a related study, Nazaretsky et al. found that students’ perceptions of feedback varied depending on the provider’s identity. When the feedback source was unknown, students rated AI feedback more favourably; however, when the source was revealed, they placed greater trust in human feedback. Although highly relevant to trust, the effects of hallucinations in GenAI on feedback uptake has received little attention in the literature and warrant future attention. Perceptions of fairness have also been somewhat contradictory: while some studies found that GenAI feedback was rated as fair by students, other studies observed the opposite. GenAI feedback has also shown potential to support important metacognitive processes. For instance, Tang et al. demonstrated that structured GenAI feedback on writing tasks significantly improved students’ self-assessment accuracy, which is a key skill for independent learning. However, this potential is not always realised. Jin et al. found that students with low feedback literacy engaged only minimally with a GenAI-based support tool, often due to a mismatch between the tool’s responses and their expectations. These findings suggest that the impact of GenAI feedback depends not only on its technical qualities but also on learners’ readiness to interpret and apply it effectively. As Zhan and Yan argue, fostering feedback engagement in a GenAI context requires the explicit development of students’ feedback literacy, including skills in prompt engineering, evaluative judgment, and metacognition, to facilitate deeper and more meaningful interaction with GenAI in feedback practices. Future research should aim to (a) investigate the extent to which feedback literacy of students can be promoted to more effectively and critically engage with AI-generated feedback and (b) understand whether feedback literacy enables learners to improve their learning outcomes when using AI-generated feedback.

GenAI can support generation of feedback in different modalities that goes beyond textual feedback. For example, learning analytics offers dashboards as an alternative and cost-effective approach to provide feedback based on analysis of student data. However, learning analytic dashboards have not achieved their full potential. One of the main reasons for this is relatively limited data visualisation literacy of educators and students to understand and translate insights from different statistics and charts into action. To address the limitations in visualisation literacy, GenAI can offer two complementary approaches. First, GenAI can provide a layer guiding educators and learners to improve their abilities to comprehend dashboards accurately. For example, Yan et al. developed a tool called VizChat, which allows students and educators to interact with a chatbot to help them understand the data shown in the dashboard by asking questions (Figure 2.1). When configured in a proactive mode (i.e. used scaffolding questions), VizChat significantly enhanced the comprehension of learning analytic dashboard compared to both passive chat mode (i.e. responding to student queries a la ChatGPT) and standalone scaffolding. Importantly, these benefits continued to persist even when the students did not have access to proactive VizChat. Building on these promising results, future research should investigate the extent to which learners and educators can transform insights obtained from learning analytic dashboards into effective learning and teaching practice thanks to GenAI. Second, GenAI can be used to generate feedback in other forms than text, for example in the form of data comics. Data comics follow established principles of comic strip genres (e.g. Manga) and are generated by promptingmultimodal language models to generate images based on analytic insights. Data comics were applied in simulation-based learning for healthcare professionals, where student nurses engage in highly collaborative learning scenarios in physical spaces. Data comics (see Figure 2.2) aim to present feedback in a more accessible, emotionally engaging format. Qualitative evidence suggests that data comics can improve student motivation and reflective engagement; some students even reported feeling seen or valued. However, some students in higher education found this approach as not sufficiently professional and potentially perpetuating biases (e.g. all nurses generated in data comics were women and the doctor was a man). Expanding positive aspects of GenAI-powered data comics, future research is needed to understand their effectiveness across different educational contexts and levels, while minimising potential negative effects. The same idea could also be applied to the AI generation of video clips based on the multimodal learning analytics.

Feedback on learning processes (e.g. goal setting, strategy use, and self-monitoring) is underrepresented in existing literature on AI in education. While learning is never fully transparent, learning analytics has made substantial progress in visualising and interpreting the otherwise invisible dynamics of the learning process, which can provide educators and learners with actionable insights into how learning unfolds. With advances in learning analytics, we can now analyse fine-grained trace data such as clickstreams, mouse movements, and other digital traces of student activity to identify cognitive, metacognitive, affective, and motivational processes. Existing research in learning analytics has also shown that such approaches can offer insights into nuanced details about learning strategies learners used. Moreover, existing research has shown that learning processes can explain more variance in student essay scores than linguistic essay properties (e.g. text cohesion) that are commonly used in automated essay scoring. However, translating insights from the underlying representations of data analytic models – e.g. process maps, networks, or descriptive statistics – requires considerable data literacy, which can be a barrier for many educators and learners. GenAI holds a strong potential to support feedback practices on learning process due to their ability to combine insights from data analytic models about learning processes with instructional information and subject matter content. By combining all these perspectives, GenAI can produce contextually relevant and personalised learning support – e.g. feedback or scaffolds – that aim to guide learners to improve their learning processes and performance. As outlined in Box 2.2, LLMs can be prompted with insights of real-time analytics of processes of self-regulated learning along with information about principles for effective feedback and relevant content information, to generate personalised scaffolds. The potential of process feedback has profound implications in the age of GenAI. As students can now easily use GenAI tools to produce polished final products, it becomes increasingly important to assess how students engage with the learning process, rather than focusing solely on the end result. Moreover, process feedback can highlight important critical challenges - learners may face when using GenAI (e.g. metacognitive laziness and overreliance). The transformative potential of process assessment is further discussed below

GenAI holds the promise to generate assessment items. Although GenAI can produce a wide range of content, its use in standardised assessment requires generating items that meet psychometric standards of validity and reliability; Emerging evidence suggests this is feasible. For example, Bhandari et al. showed that ChatGPT can generate psychometrically sound items for Algebra, while Attali et al. demonstrated similar success for reading tasks. The work by Attali et al. underpins the automated item generation process used in the Duolingo English Test that is a widely recognized language proficiency exam. GenAI can be used to evaluate the quality of assessment items. Work at Duolingo emphasized the importance of the human- in- the-loop to perform item quality review and sensitive review as part of quality assurance and before checking for psychometric properties of the generated items. This is also aligned with the recommendations by Moore et al. combining human judgements with LLMs to produce high quality multiple choice questions and short answer questions.

There is growing evidence of the potential of the use of GenAI in existing assessment practices. Existing research shows that the use of GenAI can be particularly effective when fine-tuned LLMs are used for automatic scoring of open-ended responses, demonstrating accuracy comparable or superior to models based on conventional machine and deep learning approaches. Latif and Zhai, for instance, showed that a fine-tuned version of GPT-3.5 significantly outperformed BERT in scoring multi-label and multi-class science education tasks, achieving up to a 10.6% accuracy improvement. Similarly, GPT-4 has shown strong alignment (Quadratic Weighted Kappa (QWK) over 0.8) with contemporary writing evaluation tools in high-stakes language assessment contexts for L2 English learners, especially when provided with a single calibration example for each rating category. However, Mansour et al. showed that conventional approaches dramatically overperformed ChatGPT-3.5 Turbo and Llama2 (average QWK of 0.817 vs 0.313 and 0.201) on automatic essay scoring of English essays from the Automated Student Assessment Prize dataset, which contains essays written in English by U.S. secondary school students in grades 7–10 for whom English is the first language on persuasive, source-dependent, and narrative writing tasks. The results indicate that although LLMs can potentially be useful for some types of automated scoring tasks, they may not be for others. It is therefore important to extend the existing body of knowledge to understand the types of tasks LLMs can be effective for to inform educational practice and policy. Likewise, educators need to be careful in their choices of relying on GenAI for automatic scoring. Several studies have examined the extent to which GenAI can automatically assess responses to open-ended questions in standardised assessments and identify effective prompting strategies. For example, Rodrigues et al. evaluated GPT-4 across 738 open-ended questions drawn from high school Biology, Earth Science, and Physics tasks categorized by Bloom’s taxonomy. The model produced high-quality responses overall, though its performance declined on questions requiring factual recall or creative reasoning. Chan et al. analysed LLMs in standardized STEM assessments and showed that chain-of-thought prompting significantly improved accuracy, particularly for reasoning-intensive problems. In higher education, Moore et al. explored GPT-3’s ability to evaluate student-generated short-answer chemistry questions in online college courses and found only modest alignment (32-40%) with expert judgments. Together, these studies show that while GenAI can complement human grading in structured educational contexts, its reliability still varies by domain, cognitive demand, and prompt design, highlighting the continued need for human oversight in both item generation and scoring.

Despite the promise of generative AI, its integration into education raises fundamental questions that demand critical scrutiny. As the capacity of AI systems to automate cognitive tasks increases, it becomes imperative to interrogate not only what these technologies can accomplish but also what might be lost in the process. This section examines how prevailing assumptions about skill development and assessment are being disrupted and suggests a reorientation of educational priorities for the era of generative AI.

A central contention of this chapter is that educational systems must intentionally foster human capabilities even as they leverage GenAI’s transformative potential. This imperative is not merely pedagogical; it is foundational to the cultivation of human skills that will enable individuals to thrive in rapidly evolving digital environments. According to Yan, Greiff, et al., it is important to distinguish between two interrelated dimensions in human learning when using GenAI: AI-empowered performance and human skill development (Figure 2.4). The first dimension (vertical in Figure 2.4) is focused on development of human skills – i.e. human learning. This dimension has traditionally been covered in education including educational research in AI in education and the application of GenAI to support human learning are covered in the previous section on "Existing Practices". However, the ubiquitous presence of GenAI changes the context in which learning happens. This is why we also consider the second dimension (horizontal in Figure 2.4), which concerns the extent to which individuals use AI tools, such as large language models, to enhance task execution and produce high-quality outputs. In the remainder of this section, we consider the implications of these two dimensions on human skill development according to evidence emerging from the existing literature.

The intersection of these dimensions defines the horizon toward which education should strive: learners who combine strong independent skills with the effective, reflective use of AI augmentation. However, a growing body of evidence suggests that this aspiration is not easily achieved and that the introduction of generative AI can create a "mirage of false mastery," where high-quality, AI-enabled output conceals underlying weaknesses in human skill – i.e. the undesirable curve in Figure 2.4 where task performance does not correlate with learning. While generative AI has shown promise in supporting various educational tasks, its effectiveness in fostering longterm skill development remains uncertain. An important study in this space is conducted by Darvishi and colleagues who investigated the extent to which an AI support tool could extend student ability to provide effective peer feedback. The GenAI tool was designed to support students in generating feedback more effectively, rather than directly improving the content of their responses. In a large-scale randomised controlled trial with approximately 1 600 students, Darvishi et al. observed that while initial AI-supported gains in peer feedback quality were significant, these gains were not sustained once the tool was withdrawn. Students did not retain the feedback skills that appeared to have been acquired with AI support. Moreover, there was no robust evidence of synergistic development of human and AI-empowered skills; students generally exhibited strength in either AI-assisted performance or independent skill, but rarely both. These findings are echoed by a systematic review and meta-analysis by Vaccaro et. al. which analysed 106 experimental studies of human-AI collaboration. The meta-analysis found that, on average, human-AI combinations performed worse than the best of either humans alone or AI alone, especially in decision-making tasks. This cautions against the assumption that human-AI synergy will naturally emerge. The risk in many educational contexts is that generative AI either simply augments current abilities or, more problematically in an education content, substitutes for human effort without fostering genuine skill development. This pattern of substitution is often driven by external pressures. Research by Abbas et al. revealed that university students were more likely to use ChatGPT when facing a high academic workload and time pressure. Their study, involving nearly 500 students, found that this utility came at a cost; increased use of ChatGPT was correlated with higher levels of procrastination, self-reported memory loss, and ultimately, diminished academic performance. Such findings suggest that students may turn to GenAI not as a partner for learning but as a tool to manage overwhelming demands, leading to unintended negative consequences. Furthermore, this substitution can foster an uncritical over-reliance on AI. In a systematic review, Zhai et al. investigated how over-reliance on AI dialogue systems affects students’ cognitive abilities. They define over-reliance as the uncritical acceptance of AI-generated recommendations, a tendency that arises when individuals struggle to assess the trustworthiness of the tool. Their findings indicate that this behaviour encourages the use of cognitive shortcuts, favouring fast, efficient answers over slow, effortful reasoning. This preference undermines the development of essential cognitive abilities, including decision-making, analytical reasoning, and critical thinking. It is crucial, therefore, to resist the temptation to conflate AI-augmented performance with authentic competence or deep learning.

Another dimension of concern relates to the effects of the use of GenAI on learning processes. There is accumulating evidence that increased reliance on GenAI tools can suppress students’ engagement in self-monitoring (defined as the ongoing process of checking, regulating, and adjusting one’s understanding and strategies during learning), reflection, and evaluative judgement of one’s learning processes, processes that are fundamental to autonomous learning. When GenAI is used as a shortcut rather than as a scaffold that promotes learning, students may defer cognitive effort to technology, thereby weakening the very skills that underlie deep learning. Empirical research has begun to quantify risks of GenAI on reduced human cognition and metacognition. In a study comparing the use of ChatGPT to traditional search engines for a scientific inquiry task, Stadler et al. found that students using the large language model experienced a significantly lower cognitive load. However, this cognitive ease came at a cost: these students produced lower-quality reasoning and argumentation in their final recommendations compared to the group using the Google search engine. This highlights a critical tradeoff, suggesting that while LLMs can reduce the cognitive burden of information gathering, they may not promote the deeper cognitive engagement necessary for high-quality learning. This finding is reinforced by a randomised experimental study by Fan et al., which compared university students’ writing processes when supported by ChatGPT, a human expert, a writing analytics tool, or no additional support. While the ChatGPT-supported group showed greater improvements in essay scores, these gains did not translate into deeper knowledge acquisition or transfer (as measured by knowledge transfer test on different topics). More importantly, the study found that learners in the AI-supported group demonstrated a marked reliance on the technology and were less likely to engage in metacognitive activities such as self-monitoring and reflection, a phenomenon the authors term metacognitive laziness. The impact of AI on self-directed learning is further complicated by students’ motivations for using these tools. A yearlong longitudinal study by Xie et al. examined how interaction frequency with chatbots affected learning autonomy. The results were nuanced: for learners seeking virtual companionship, the social presence fostered by the AI had a positive mediating effect on their learning autonomy. Conversely, for learners focused purely on knowledge acquisition, more frequent interaction with the chatbot was negatively correlated with both social presence and learning autonomy. This indicates that the effect of AI interaction is not uniform and that frequent use for instrumental purposes may undermine the development of independent learning habits. These findings illustrate a crucial distinction: apparent improvements in performance enabled by generative AI may mask deficits in learners’ underlying cognitive and metacognitive processes. However, this does not mean AI cannot play a productive role in learning. When structured intentionally within a collaborative learning environment, AI can act as a powerful scaffold. For instance, An et al. studied student teachers using a mind-mapping tool integrated with GenAI. The groups using the AI tool not only outperformed the control groups on their collaborative tasks but also demonstrated a more sophisticated knowledge construction process, moving progressively from individual ideas to peer interaction and group synthesis.

As students increasingly use GenAI tools for learning, traditional assessment models that focus solely on final outputs are becoming inadequate. When high-quality products can be produced with minimal engagement in the learning process, assessment risks measuring technological proficiency rather than human skill or understanding. To address this challenge, there is a pressing need to reorient assessment practices towards process-oriented approaches that evaluate not just what students produce, but how they engage with learning to create products. Assessments should aim to capture the processes students use to plan, monitor, and adapt their work, thereby revealing the authenticity and depth of their learning in GenAI-rich environments. Only by prioritising cognitive and metacognitive engagement alongside product quality can educational systems ensure that AI augments, rather than supplants, the development of meaningful human expertise. One promising way to operationalise this shift is through evidence-centred assessment design (ECD). The ECD framework provides a principled model for linking assessment tasks, evidence, and inferences about learners’ knowledge and skills. By moving beyond a narrow focus on final outputs, ECD enables the design of multidimensional assessments that capture both product and process evidence. An illustrative example of this process-oriented approach comes from recent work in medical education, where clinical reasoning tasks have been redesigned to capture a more holistic view of learning. Drawing on the ECD framework, this approach moves beyond assessing only the final diagnostic conclusion. Instead, it builds a multidimensional evidence model by collecting three streams of data as students interact with GenAI-powered virtual patients: product evidence (e.g. diagnostic accuracy), process evidence (e.g. conversation logs where students do history taking), and metacognitive evidence (e.g. clickstream data and interaction logs). Analysis of this rich data reveals that integrating all three evidence sources provides a significantly more reliable prediction of learner performance than relying on product-based measures alone. Notably, process data emerged as the strongest standalone predictor, underscoring the value of assessing the “how” of learning, not just the “what.

Building on this ECD foundation, predictive models are now being paired with Explainable AI (XAI) to make the assessment process not only accurate but pedagogically meaningful. Simply predicting performance with a “blackbox” machine learning model is insufficient for supporting learning. To make insights actionable, the XAI layer identifies the key factors influencing a prediction. These technical explanations are then translated by a GenAI system into structured, personalised, and pedagogically relevant feedback for the learner. This hybrid XAI–GenAI approach ensures that feedback is aligned with self-regulated learning principles, helping students understand not only their performance but also the cognitive and metacognitive strategies that shaped it. By grounding feedback in specific evidence from the learning process, this approach extends the ECD model beyond assessment design to feedback delivery, providing transparent, actionable guidance that fosters genuine skill development. Figure 2.5 demonstrates how this assessment approach is implemented in the FLoRA platform for history-taking skills as part of the development of clinical reasoning in medical education. Learners first interact with the virtual standard patients Figure 2.5 A, which are also based on GPT. Once the learner completes the interaction with the virtual standard patient and submits their diagnosis (Figure 2.5 B), the system applies the evidence model and generates personalised feedback (Figure 2.5 C)

The methodological rigour of research on generative AI in education is critical to produce quality evidence. If we are to make sound, evidence-informed decisions, it is essential to move beyond the commentaries and hype cycle and uphold high standards of empirical inquiry. As also indicated in the previous subsection and Figure 2.4, a central challenge in producing robust empirical evidence about effects of GenAI on human skills is the pervasive conflation of performance with learning. Performance refers to the observed performance of a task, whereas learning involves an enduring change in knowledge and skills that is demonstrated through retention and transfer. The distinction is essential; high performance, especially when mediated by a powerful tool, does not imply that learning has occurred. A second, related but distinct issue is the media/methods fallacy. For decades, researchers have cautioned against simplistic “media comparison studies” that attribute learning gains to a technology itself, rather than to the specific instructional methods it enables. Much of the nascent research on generative AI repeats this error, comparing an ill-defined “ChatGPT condition” with a control group and concluding that the technology “works”. Such designs may demonstrate that a particular arrangement (e.g. students working with ChatGPT) can yield different outcomes than another (e.g. students working alone). However, because they attribute effects to the technology as a whole rather than to the specific instructional processes it affords, these studies provide limited insight into the underlying mechanisms. This limits their explanatory power and risks conflating performance support with genuine learning . A further methodological weakness, distinct from but often co-occurring with the media/methods fallacy, is the conflation of task performance with learning. For instance, meta-analyses claiming that ChatGPT enhances “academic performance” often measure immediate task achievement, not durable learning, and sometimes not even learning. While students may produce a better essay or translation with AI assistance, this performance gain may mask a lack of underlying cognitive engagement and learning gains. As discussed earlier, offloading effort to AI can reduce cognitive load but also risks fostering “metacognitive laziness”, thereby undermining the very processes required for deep skill development. This problem is amplified by the “fast science” culture, where sensational claims, such as GPT-4 “acing” the MIT curriculum gain traction despite significant methodological flaws, including data contamination and a lack of transparent verification. Even if such claims were accurate, they would have limited educational meaning, as GenAI outperforming standardised or benchmark tasks does not come with conceptual understanding and does not imply that the underlying processes of learning and transfer by humans should be abandoned. The danger lies in conflating technical proficiency with educational value, which can distort expectations and policy directions and fuel the kind of policy-practice misalignment that has characterised many AI-in-education debates. To build a robust evidence base, the field must adopt a more rigorous research agenda. First, researchers must explicitly differentiate learning from performance by incorporating process-oriented assessments, such as delayed retention and knowledge transfer tests, into their designs. Second, studies must move beyond media comparisons to isolate causal mechanisms, clearly defining the pedagogical function of the AI intervention, much like the decades of theory-driven research on Intelligent Tutoring Systems. Finally, we must prioritise longitudinal research that tracks the durable effects of AI interaction on students’ knowledge, skills, and dispositions over time.

For policymakers and funding organisations, this highlights a critical need to guide future investment. To build a robust evidence base, funding should prioritise longitudinal studies that track durable skills, demand that interventions clearly specify their pedagogical underpinnings and support the development of process-oriented assessments. Only by investing in research that distinguishes task performance from learning can we ensure that technology serves our ultimate goal: fostering deep and lasting human competence.

In conclusion, our findings underscore the significant promise of GenAI in enhancing educational practices related to learning and assessment. Specifically, we have demonstrated that GenAI-powered systems can directly support both students and educators, streamlining teaching activities and providing targeted assistance. However, despite these promising developments, our analysis also reveals several critical caveats that must be carefully considered when informing future practice and policy in this area. One such concern is the need for careful attention to the design of teaching practices that enable the effective use of GenAI, particularly in systems designed to directly support students, such as tutoring chatbots. For example, recent studies have shown that combining GenAI with established instructional methods (e.g. scaffolding), where GenAI agents guide students through step-by-step reasoning, can foster genuine learning and sustained performance enhancement even after removal of GenAI support. By contrast, unguided “answer-giving” practices, where students simply request solutions from a chatbot, have been found to undermine reflection and suppress metacognitive engagement. As emerging evidence suggests, not all students may benefit equally from these systems. Therefore, it is essential to consider how different student subpopulations, based on factors such as socio-economic status or prior academic achievement, interact with these technologies. By identifying the specific conditions under which these subgroups may benefit from GenAI-powered systems, we can mitigate potential inequalities and ensure that AI tools support a diverse range of learners. Moreover, while GenAI offers the potential for rapid development of tools like tutoring chatbots, one should recognise that on average general-purpose LLMs (e.g. off-the-shelf GPT systems) do not yet match the effectiveness of traditional intelligent tutoring systems, when not designed or finetuned with adequate pedagogical knowledge. Hybrid systems that embed GenAI within educationally grounded frameworks may show more promise, but the evidence base remains limited. We still also need evidence that will compare how GenAI-powered tutors compare to their conventional counterparts in terms of their ability to provide sustained, long-term learning support. As a result, future research needs to investigate whether GenAI-based tutoring systems can effectively support learners over extended periods. In addition, there is considerable potential for GenAI to complement existing intelligent tutoring systems by enhancing their interactivity and enabling more natural language communication, which could ultimately create more personalised learning experiences. Future research should focus on exploring how GenAI can be integrated into intelligent tutoring systems, drawing on well-established educational principles to enhance these systems’ overall efficacy. Despite the promising capabilities of GenAI to generate high-quality feedback, research has shown that students’ trust in AI-generated feedback varies considerably across contexts. In some studies, students respond positively and perceive such feedback as clear and useful, while in others they express scepticism about its accuracy or relevance. This variability in trust can influence whether learners engage with GenAI feedback, which in turn affects its potential impact on learning. To fully realise the benefits of GenAI in supporting feedback processes, future work should focus on developing teaching practices that help integrate AI-generated feedback effectively into classroom use. One promising direction is to use GenAI as a tool to help educators reflect on and refine their feedback (i.e. human or AI-generated) by checking whether it is clear, balanced, and aligned with established feedback principles before it reaches students. GenAI shows promise in supporting educators with their daily teaching and administrative tasks. Although existing evidence grounded in more reliable measures of time spent with technology, such as usage log analysis, shows increases in efficiency for some tasks like lesson planning, qualitative studies highlight potential «blind spots» in these estimates that warrant further research. Specifically, the hidden labour educators must invest in reviewing and verifying the accuracy of AI-generated content may not be fully accounted for when relying solely on usage logs, unless they do the revisions online. All this emphasises the need for further research into how the GenAI tools can be designed to enhance, rather than complicate, teaching practices. In the realm of assessment, GenAI offers valuable opportunities to streamline the creation of assessments and automate scoring processes. Its potential has been demonstrated in large-scale standardised tests, such as the Duolingo English Test, where GenAI can assists in generating items that meet psychometric standards. However this level of psychometric rigor is rarely required or feasible in everyday classroom assessments. While the widely available LLMs can help teachers design questions or tasks more efficiently, its outputs still require human review to ensure pedagogical relevance, truthfulness, fairness, and alignment with learning goals. Likewise, prompt-based approaches to automated scoring, although more accessible to non-technical users, remain less reliable than finetuned or conventional machine learning models. Educators should therefore treat GenAI tools as a complementary aid rather than a substitute for human judgment, validating its outputs for clarity and appropriateness before classroom use. Future research should focus on developing practical frameworks that help teachers integrate GenAItools into formative and summative assessment processes responsibly, combining the efficiency of automation with the interpretive expertise of educators. To assist educators, future work needs to focus on developing classroom assessment strategies that incorporate GenAI in meaningful ways, expanding its applicability while also ensuring that it enhances the overall learning and teaching experience for both educators and students.

Finally, our study highlights a critical risk: the uncritical adoption of GenAI may inadvertently undermine the development of key human skills such as critical thinking, metacognition, and evaluative judgment, all of which are foundational to genuine expertise. This could result in what we describe as the “mirage of false mastery,” where the impressive outputs generated by AI mask the underdevelopment of essential skills, including hybrid human-AI skills(Box 2.3). The path forward, therefore, is not a rejection of technology, but a commitment to pedagogical intentionality and methodological rigor. Rather than simply asking whether GenAI “augments students’ task performance,” we must focus on how it can be used to foster deep, meaningful, and durable learning. This means reorienting our focus from GenAI-driven products to human-centred processes, ensuring that GenAI tools are designed to scaffold rather than supplant human thinking. By prioritising the development of durable, transferable skills and integrating metacognitive awareness into both learning and assessment, we can unlock the transformative potential of GenAI, creating an educational future that is not only more efficient but also authentically human.

Search This Blog

International Day of Education

Generative AI for human skill development and assessment: Implications for existing practices and new horizons.

Comments

Post a Comment

Popular posts from this blog

(Day 2) Beyond the Algorithm: Navigating the Future of Artificial Intelligence - 49th Annual UNIS-UN International Student Conference.

Ensure that AI complements, rather than replaces, the essential human elements of learning.

(Day 1 - Part 2) Beyond the Algorithm: Navigating the Future of Artificial Intelligence - 49th Annual UNIS-UN International Student Conference.