The use of GenAI in scientific research.

GenAI is used in research for four distinct but related purposes: manipulating language (including scientific language); managing knowledge; generating knowledge; and managing the entire chain of operations in a research project. Accordingly, GenAI models as used in research can be divided into four categories: 1) general purpose models, notably large language models (LLMs) – like GPT, Gemini and others – used by many researchers for generating text, images or computer code; 2) specialised models dedicated to managing various language-related tasks such as literature reviews, refereeing, and generating hypotheses or suggestions for experiments (“ideation”); 3) specialised models used to tackle highly complex scientific problems that involve vast quantities of data or complicated mechanisms (for example, the 3D shape of proteins); 4) research assistants and “robot labs”, which autonomously manage entire sequences of operations in a research project, from initial analysis of data to experimentation. We will examine these four categories of models in turn, after presenting the evidence concerning the diffusion of GenAI among researchers. A clear trend of increasing cognitive power and agency of GenAI over time will then appear.

The use of GenAI in research is advancing rapidly, though statistics remain limited. Broader trends in AI adoption can help indicate future GenAI patterns. AI adoption has been progressing fast. Duede et al. (2024) trace the share of AI-engaged publications (1985-2022), which rose from ~2% in 2015 to 8% in 2022 across all scientific fields (Figure 13.1). Evans et al. (2024), analysing 100 million papers (1980-2024), identify over 1 million AI-assisted papers (1.57% overall), showing pervasive adoption across biology, medicine, chemistry, physics, materials science and geology.

For GenAI specifically, Liang et al. (2024[4]) studied nearly 1 million papers (2020-2024) and found steady growth in “LLM-modified” papers, with the sharpest rise after ChatGPT’s release. Uptake is strongest in Computer Science (17.5%) and weakest in Mathematics (6.3%) (Figure 13.2). The analysis does not include social science, which is one of the main contributors of education research, although natural sciences represent a significant minority of the education research production (Vincent-Lancrin and Jacotin, 2023).

A survey (conducted in March 2025, with 5 000 researchers) estimates that over half of researchers already use AI for manuscript preparation and error detection (Naddaf, 2025[6]). About one-third use or plan to use GenAI for data collection/processing, while its use for complex tasks (journal choice, citations) remains less common. More than half of respondents believe that AI outperforms humans in tasks like literature review, summarising, plagiarism checks and citation management, anticipating mainstream adoption within two years (Figure 13.3). Early-career researchers show higher enthusiasm than senior colleagues, although many remain cautious about AI’s role in higher-level tasks. Despite the differences across fields, one can assume that the uptake for these mainly language-related tasks is similar for education researchers.

A first category of uses relates to language: translating, editing, writing and summarising papers. These tasks often involve general-purpose LLMs (ChatGPT, Claude, Gemini, etc.), which are the most accessible, but not exclusively, since some science-specialised tools also have such capabilities. AI can help adapt papers to meet journal submission guidelines, draft abstracts, write peer reviews, and assist in drafting grant proposals (Heidt, 2025). Although researchers were already using some AI writing assistants, the release of LLMs brought a substantial change in the extent of such use (Lenharo, 2024). The AI model generates text in response to a query (a “prompt”) given by the researcher. Here is an example of a prompt: “I’m writing a paper on [topic] for a leading [discipline] academic journal. What I tried to say in the following section is [specific point]. Please rephrase it for clarity, coherence, and conciseness, ensuring each paragraph flows into the next. Remove jargon. Use a professional tone.” (Gruda, 2024) Machine-assisted editing is especially useful for non-native English speakers, due to its potential to improve flow, grammar and tone. In a poll by the European Research Council (ERC), 75% of more than 1 000 ERC grant recipients felt that generative AI would reduce language barriers in research by 2030 (Prillaman, 2024). A Nature poll (Kwon, 2025) surveyed more than 5 000 researchers worldwide (with China underrepresented) in March 2025. More than 90% of respondents said they believe it is acceptable to use generative AI to edit or translate one’s research paper. When it comes to generating text with AI – for instance, writing all or part of a paper – a majority (65%) think it is ethically acceptable, but about one-third are against it. The most popular use was editing a research paper, but only around 28% said they had done this. That number dropped to about 8% for writing a first draft, making summaries of other articles for use in one’s own paper, translating a paper, and supporting peer review. While 42% of PhD students report using AI for editing purposes, the percentage drops to 22% for senior researchers.

Respondents were asked the following question: “Which, if any, of these represent use cases or solutions that are similar to anything you are already doing and/or have already tried with AI in the past?

Work by Kobak et al. (2025) finds that one in seven biomedical abstracts published in 2024 (among 1.5 million papers indexed in PubMed) was written with AI assistance. They detect such abstracts by identifying “excess words,” i.e. words whose frequency has surged since the rise of LLMs but that have no functional role (e.g. “delve,” “unparalleled”; in total, there are 454 excess words).
Support from AI in writing can boost researchers’ productivity for certain non-core tasks, like polishing style or handling administrative duties, freeing up time for more conceptual work (Gruda, 2024).
Specialised models (e.g. the Black Spatula Project and YesNoError) are used to spot errors in research papers, including factual mistakes, calculation errors, methodological flaws and referencing issues.
The systems first extract information, including tables and images, from the papers. They then craft a prompt, which tells a ‘reasoning’ model — a specialised type of LLM — what it is looking at and what kinds of errors to hunt for. The model might analyse a paper multiple times, either scanning for different types of errors each time or cross-checking results. However, the rate of false positives — instances in which the AI claims an error where there is none — is a major hurdle (10% on average according to some tests; for example, the model may state that a figure referred to in the text does not appear in the paper when it actually does) (Gibney, 2025)

As science is getting ever more quantitative, programming and analysing data are the main tasks of many researchers, especially PhDs, across all disciplines (including, increasingly, the humanities). This is certainly the case in education research, where the share of quantitative research has increased over the past decades, even though it remains a minority of education research (Vincent-Lancrin and Jacotin, 2023). These tasks require specific skills in complex techniques and can consume a lot of time (e.g. for “debugging”, i.e. tracking mistakes in computer code), while exposing the researcher to significant risks of errors. Special tools, based notably on GenAI, have been developed to alleviate these burdens. Code editors are tools that aim to make it easier for researchers to use coding to organise data, create analytical sequences, generate descriptive statistics or visualisation. Such tools are now widespread, having overtaken GitHub and Stack (a community site) for troubleshooting. These tools allow researchers to save a lot of time, generate higher quality outputs and allocate time to more substantive matters. Rather than spending hours waiting for answers from correspondents, users can simply highlight a section of code and ask a GenAI chatbot to fix it (Heidt, 2025). There are also more sophisticated AI models that can do extensive analysis of large tables of numbers and generate output like predictions (imputing), error detection, etc., hence avoiding the need for the researcher to do the programming by themself. For instance, TabPFN is a “tabular machine learning” model that infers outcomes from tables of any sort of data. It can take a user’s dataset and immediately make inferences about new data points (McElfresh, 2025). GenAI models have also a strong capability of processing “unstructured data”, like texts and images, that can thereby be quantified and subjected to powerful statistical treatment: this is clearly of specific interest for humanities and educational research.

While LLMs process words, models based on the same techniques (notably so-called “transformer” architecture) can be trained with other types of data: chemical formulae, mathematical concepts, astronomical pictures, DNA, etc. These models can also mix different types of data as inputs (“multi-modal”), or they can take one type of data as input and generate a different type of data as output. This diversification in data types allows for the application of GenAI models to a broad variety of problems, in a broad variety of disciplines and contexts. These models are commonly applied to so-called “closed-world problems”, where the fundamental laws are known, but drawing out predictions is computationally difficult, because the parameters and variables are too numerous, or the relations are complex and non-linear. Examples abound in biochemistry, material sciences or weather forecasting. This allows for the combination of fundamental, established knowledge with an algorithm’s superior capacity to find meaningful correlations in data. These models are statistical in nature and are thus trained on vast amounts of data, which restricts the cases where they may apply (not all domains offer sufficient amounts of data). These models allow researchers to gain time and reduce the cost of research. “Our goal,” says a biologist, “is to create computational tools so that cell biology goes from being 90% experimental and 10% computational to the other way around.” This comment was made in regard to a project using AI to create a “virtual cell” (Callaway, 2025). While education systems generate large amounts of data, privacy and ethical concerns has made their widespread use for education research complex. Still, some of these techniques could increasingly apply to the analysis of large data sets and are already in use to some extent. For example, Pardos and Borcher (2026) use these AI analysis and visualisation tools to show the similarity between higher education courses based on student enrolment history. Moreover, education research increasingly builds on neuroscience, cognitive science and one can imagine that learning science will benefit from advances in the study of the brain from a chemistry or biological perspective as well. For example, advanced AI techniques may help to better understand the clinical and socio-genetic dimensions of learning and education performance (Isungset et al., 2022; Morris et al., 2022).

Chemistry and biology are leading disciplines where GenAI has been applied, due notably to the availability of a lot of data and to the combinatorial nature of mechanisms at work. The task of most models is to relate some property (therapeutic or physical) to the composition of a compound. Hence, a model can either predict the properties of given compounds, or, alternatively, it can predict the composition of compounds that display given properties. Some models can also do retrosynthesis, i.e. predict the sequence or network of chemical reactions that allow for the production of a particular compound with given ingredients (reactants). These properties and reactions obey known physical and biological laws, but the number of components, the nonlinearity of many mechanisms involved and the sensitivity of aggregate properties to minor modifications make it difficult, or even impossible, to analytically solve most cases. In the case of biology, the compounds are extremely complex (proteins can be made of several thousand molecules). Most models mix data analysis with knowledge of the basic rules of the domain, so that the generated items comply with the known laws of the domain. There is a strong analogy between chemistry or biology on the one hand, and language on the other hand, as both are compositional: they are made of elementary components (words or molecules) that combine to produce emergent properties (meaning or physical characteristics). Hence, the techniques used to train LLMs have been directly transferred to these domains. Certain researchers even directly use LLMs to conduct chemical analysis, although the training base of LLMs in the domain is not as large as specialised models. They note: “Our results show that LLMs can accurately reason about chemical entities in both local and global terms, analysing single reactions but also whole synthetic routes, and that such capabilities can be exploited through search algorithms for solving chemical problems in more flexible terms.” (Bran, 2025). Some models are made of several interconnected modules handling different types of data, and leverage the synergies between these types of data. Certain models relate natural language with chemical or biological data, which lets users query the model about the chemical composition that some particular property would possess, explained in natural language.

Box 13.1 provides examples of specialised scientific AI models.

LLMs can be used to simulate human participants in empirical studies, for example, producing synthetic interviews, interactions between actors or specific behaviour in particular situations. There is active research on simulating human behaviour, with models being especially trained on psychological material (e.g. Binz, 2025) and integrating knowledge from the cognitive sciences. AI models trained on human behavioural data can serve as test benches for simulating human decisions regarding various contexts including educational ones. They could play a similar role as organoïds (selfassembled constructs that mimic certain properties of in vivo organs) play in medical research. Such models accelerate studies and reduce their cost. There are still limitations to the potential of this approach though, as the capacity of LLMs to simulate the diversity of human behaviours in complex situations is still limited. For instance, in a study about factory working conditions, a worker on the floor and a manager would likely have different responses about a variety of aspects related to the work and workplace. However, an LLM participant’s generated responses might combine these two perspectives into one answer, conflating attitudes in ways that are not reflective of reality (Kapania et al., 2025). Should this use of AI become fruitful, it could have a strong impact on education research, notably for the generation of survey answers, which, according to survey implementers, have become increasingly difficult to collect. For example, in the production of standardised assessments, (Liu, 2025) show that multi-agent AI models bringing together ensembles of LLMs that can serve as “synthetic respondents” producing response distributions with psychometric properties closely aligned to those of college students. Pardos and Borchers (2026) argue that LLM-based calibration can complement limited student response data, reducing costs and accelerating item validation cycles. While human responses remain essential (not least because they are used to generate simulated ones), AI-generated responses could augment them and, as is the case of answers to test items, expand their variance while remaining aligned with them. While it will take time to assess when simulated answers add value without distorting human responses, this is a line of AI impact that will be particularly useful for education research if it becomes successful.

“We are dwarfs standing on the shoulders of giants,” Bernard de Chartres famously said, characterising the cumulative dynamics of knowledge: new discoveries are primarily elaborations and combinations of past ones. Access to and mastery of existing knowledge is key for researchers to build on that knowledge and make new discoveries. Hypothesis generation is key to the research process and making new discoveries; it is closely tied with the existing knowledge on which it relies, but it also involves distinct mechanisms that will be examined in the next section. With the mounting number of scientific publications (articles, databanks, images, computer programmes, etc.), it has become increasingly difficult for researchers to keep pace with advances in their own field, despite increasing specialisation within scientific domains. Hence a new challenge for researchers: improving on knowledge with which one lacks familiarity. AI has given rise to tools that can support researchers in these tasks, such as generalist LLMs or specialised models (e.g. Elicit, Consensus, Clarivate, PaperQA2, BioloGPT) (You, 2024). These tools can conduct knowledge management operations like searches in the literature, summaries and literature reviews, which we will examine below. These models work as well for education research although they are not fine-tuned for this domain.

The researcher enters a particular research question in the model (for example, “is virus X responsible for disease Y?”), and the model responds with a list of publications relating to the query, and for each publication, a summary of its results relating to the question. Some models offer a synthetic (consensus) view of the literature, with lists of publications agreeing or disagreeing with the consensus and the corresponding arguments. Some tools can generate a graphic picture of the concerned research landscape, with citations-based relations between publications (who cites whom, who is co-cited with whom, etc.) (Kudiabor, 2024[29]). Compared to LLMs, specialised models aim to offer superior reliability, as they make use only of scientific publications, avoiding blogs and other sources of lesser scientific reputation. Certain tools can also offer other products beyond the aforementioned search results, like literature graphs (extracting the main concepts or results of a domain and relating them to each other in a knowledge graph). Certain platforms have a “Chat with PDF” function, which allows the user to upload a paper and ask questions about its content (Heidt, 2025)

AI models can produce summaries of publications on request, which gives researchers a rapid overview of a set of papers of interest, gaining time in reading and allowing focus on the most relevant papers. However, the quality of such summaries can be low at times. Peters and Chin-Yee (2025[30]) compared the AI summaries to human summaries that some journals provide for 4 900 examples in medicine and science overall. They found that all AI models tend to overgeneralise the results presented in the papers, as they often omit important details that restrict the domain of validity of the results and leave out relevant nuances. They might, for example, just state that a drug is effective for treating a certain condition, without specifying in which dose or for which group of patients. This reflects the difficulty for AI models to fully recognise the importance of “details” that intelligent human readers find significant. The same issue would apply to education where results may be more or less relevant depending on country, socio-economic background, sex, etc.

GenAI models can provide structured summaries of the literature relating to particular questions (Skarlinski et al., 2024). These reviews are useful to researchers for getting a broad view of a question while saving time and making sure that they do not miss the most important relevant publications.

Some models offer “systematic” reviews, which include granular information on the methods and results of each paper in a standardised manner; this is necessary if a researcher wishes to reproduce an experiment or perform a meta-analysis. Some researchers are sceptical about the quality of AI-generated systematic reviews (Pearson, 2024), as AI models tend to skip specific but important information like the precise dose of a drug, as mentioned above. This would be the same for the precise pedagogical context of, say, studies about the impact of project-based learning or lecturing or the use of technology in education. More generally, literature reviews conducted with AI have certain limitations. First, many models can access only the abstracts of all publications and the full text of open access publications. Access to a significant part of the scientific literature is restricted (although publishers’ tools can, of course, access their own publications), so that many important research findings, and most notably methods, are often skipped in AI-generated reviews. Second, GenAI literature review tools sometimes struggle to identify the most relevant papers in a field, and to identify topical versus outdated literature, and can first list literature with ideas that used to dominate a domain but have now become outdated. Such systems could still be used to update human-authored literature reviews rather than generate new ones. Reviews cannot feasibly be updated by humans very frequently and AI could provide for this, even though authoritative reviews may still need human involvement.

Most recent models, like OpenAI Deep Research or Gemini Deep Research, can provide “research reports”, which go beyond a literature review as they provide broader background and context and identify pending questions (Heidt, 2025). The user can enter a query, together with their own data (articles, etc.), and the model returns a full report, including text, figures and corresponding bibliographic references. These models mimic how a person would approach a research question. This is especially interesting when exploring a domain with which the user is not familiar: it helps access general knowledge in a clear language (Jones, 2025). One specialised model, PaperQA2, writes Wikipedia-style summaries of scientific topics that are cited and significantly more accurate than existing, human-written Wikipedia articles. It can identify contradictions within the scientific literature, a task that is challenging for humans (Skarlinsky et al., 2024). Certain models generate draft research reports in which they identify gaps in knowledge, getting close to suggesting further possible research topics. The quality of these reports is debated (Jones, 2025): they often include incorrect (or invented) citations, they fail to distinguish authoritative information from simple suppositions, and they do not convey uncertainty accurately.

Hypothesis generation is a defining activity for a researcher. It consists of generating ideas from the literature or from data which are altogether novel, plausible and testable. Whereas a literature review is about what is known, hypothesis generation is about jumping into the unknown: identifying possible responses to questions that are not answered by the literature, while keeping consistent with established knowledge. This has been, until recently, a preserve of humans. Now AI can also do this. It usually involves three steps: generation of the hypothesis, evaluation/validation (or rejection), and improvement/refinement. In the case of education research, these techniques could help to explain learning trajectories or some puzzling aspects of education outcomes. AI techniques could combine multiple and remote sources of information to generate original hypotheses. For example, one could imagine AI systems generating hypotheses based on international or national datasets to make hypotheses on the factors explaining the increase or decrease of student outcomes, exploiting the big size of these data sets. But it could also connect these results to other possible sources and point education researchers to possible explanations that are not immediately visible in their data sources (e.g. learning outcomes might increase due to the recent availability or social services that lead to less absenteeism of students or better mental health as their parents get better support).

Most models draw hypotheses from the literature, but some models can also do it directly from the data they are offered to analyse. As compared to humans, AI models have the advantage of a broader knowledge of the literature: not only in the concerned discipline, but possibly in others, accessing more diverse sources (assuming that the model has been trained or can access this knowledge, which may be problematic, as illustrated by the limits of automatic literature reviews). However, AI models are confronted with specific difficulties for automatically drawing hypotheses from the literature: 1) the source texts might not make clear what the problems and corresponding hypotheses are; 2) the link between a problem and hypothesis as stated in the literature might not be straightforward; 3) the novelty or feasibility of an hypothesis can be difficult to evaluate, but they need to be measured and ranked; 4) initially designed hypotheses usually need to be improved so as to strengthen their novelty or feasibility, requiring further operations after the extraction from the literature. One simple way for a researcher to interact with a model is through brainstorming, with a prompt like “give me ten ideas of mechanisms that could explain how A influences B”. The researcher can also challenge the model, by submitting their own hypothesis and asking the LLM for counterarguments or alternative hypotheses. This simple procedure allows for initial suggestions, but for working out structured and plausible hypotheses, a more articulated approach is required, using specialised models. One approach is to insert highly structured data into a prompt so as to strictly constrain the model’s response. For instance, a researcher who investigates how microplastics are transported through soil and into groundwater could use a visualisation tool, Research Rabbit. The tool takes a single “seed paper” and generate an interconnected web of research linked by topic, author, methodology or other key features. By inserting its results into an LLM, “it’s possible to query the body of work for hidden links or new ideas” (Heidt, 2025). Regarding the difficulty for LLMs to link problems and hypotheses from reading a corpus, one solution is to fine-tune an existing LLM so that it can better identify problems and hypotheses in papers. O’Neil et al. (2025[34]) have assembled a database of 5 500 scientific hypotheses (HypoGen), with which they trained an existing LLM. These data are structured in a way that makes it clear what the problem is, what the hypothesis is, and what the chain of reasoning from the problem to the hypothesis is. Extracting hypotheses from data with AI models is made difficult by the lack of explainability of AI. For instance, the correlation between certain events might be difficult to attribute to particular features of the concerned events. The model will observe that phenomenon A is linked with B, but it could not say whether this is due to feature C or D of these phenomena. The model can see patterns in the data that are not visible to humans, and it might be difficult for the model to translate them in words, as hypotheses that can be understood by humans. Ludwig and Mullainathan (2023) propose a procedure that enables the expression of correlations found by AI in the data in words, so that they can be explicated to humans and tested. For this purpose, they use counterfactuals: they generate synthetic data that exaggerate the correlations found in the initial data, to a point where the concerned pattern becomes visible to humans and can be interpreted.

Inferring hypotheses from the literature might not be enough, or fully satisfactory, as “raw” hypotheses might be insufficiently novel or articulate, too similar to their sources, weakly plausible (not fully coherent with the evidence), or difficult to test experimentally. Thus, a process of refinement of the ideas extracted from the literature is warranted. This is one of the most difficult challenges for AI in science, as it requires both imagination and reasoning capacities: the capacity to improve an idea while keeping its core; to infer logically; to assess the proximity of an idea with the “real world”, etc. A lot of developments are occurring in this domain. The main techniques include: multi-step reasoning (“chain of thought”, requesting the machine to make explicit the steps it is following in its reasoning); reinforcement learning (training the model so that it strengthens its successful features and weakens others); evolutionary computation and multi-agents systems (see below). Models developed since 2024 include one or more of these techniques. Evolutionary computation is a technique inspired by the mutations and natural selection in Darwinian evolution. It begins with a review of the literature, from which it extracts an initial list of hypotheses. It applies small, random changes to an algorithm and selects the ones that improve the model’s efficiency. To do so, the model conducts its own “experiments” by running the algorithms and measuring how well they perform. Afterward, the model produces and evaluates a paper. After “augmenting the literature” this way, the algorithm can then start the cycle again, now building on its own results (Castelvecchi, 2024). Agentic AI is being applied also to science. An agent is an autonomous system that can pilot various tools towards a given objective. Multi-agent systems are comprised of several agents with specific objectives and specialised skills; each one pilots an AI model (e.g. an LLM) and interacts closely with others under the supervision of a “lead agent”, who acts like the conductor of an orchestra. Some models also integrate reasoning capacities like the aforementioned “chain of thought”. A multi-agent model aims to function like a group of researchers. Some researchers are specialised in a particular discipline; some researchers play a particular role, of making proposals, or challenging others’ proposals or combining them; at each step they are assigned specific tasks by the lead, and they work with their respective tools to implement them; they meet and confront their respective findings, with open discussions whose conclusions are included in a report shared with human researchers. The whole process starts with a prompt that includes a description of the problem and contextual information, which is submitted to the model. The lead and the agents then design a research plan, possibly with a sequence of sub-questions, a set of parallel tasks, a list of skills required from agents, etc. Then an iterative process can take place, in which each agent accomplishes the tasks it has been charged with, and reports to the lead and to other agents; the lead synthesises the findings at each stage and monitors the advancement of the whole process. At the end, the model can draft a research report (Biever, 2025). In education, multi-agent models with GenAI agents are for example used to develop assessment items. They may hold promise for educational research, which is often multi- or inter-disciplinary or address broad-based socio-technical issues (such as the adoption and use of AI in education). Education researchers may use such models to ensure different types of expertise, information sources and constraints are brought together, for example to generate new ideas on children’s school rhythm, which involves expertise in children’s biological and psychological development and needs, learning science and pedagogy, parental work schedules, etc. (Figure 13.4). One can also imagine some fruitful uses to generate ideas or improve usual hypotheses for education policy research, with some of the AI agents playing the role of different education stakeholders and providing ideas on addressing some education policy issues such as the provision of equal opportunities. The multi-agent models could then make new suggestions of educational interventions or of policies in this simulated environment.

Box 13.2 presents some examples of (specialised) scientific multi-GenAI agent models.

How effective are GenAI models in generating, refining and evaluating scientific hypotheses?

The evidence is still scarce, as testing such models is complex and costly. The most effective tests are implemented by performing actual research and examining the model’s achievement. Diagnoses are diverse due to the diversity of models, the diversity of research questions and the diversity of testing methods. On broad research questions, the models seem able to provide useful suggestions that point to potentially fruitful research directions. This is due to their very good access to the literature and their ability to process it in a highly structured way. According to Anthropic (2025), their “internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously.”

When it comes to more specific research questions, the evidence is mixed. There have been certain impressive achievements, with models able to identify and describe hypotheses that were then successfully tested by researchers (see Box 13.3), but also less successful cases. Successful cases come with significant involvement of humans in the process. Certain studies also point to the tendency of some models to propose solutions which are plausible but not really novel, including some that have been already explored and abandoned in the past. Wang et al. (2024) conducted extensive evaluation experiments using human annotators with domain expertise to assess the proposals of a multi-agent model called Scimon (Box 13.2). They found that “ideas still fall far behind scientific papers in terms of novelty, depth and utility – raising fundamental challenges toward building models that generate scientific ideas.” Regarding the AI Co-Scientist (Box 13.2), specialised in research on machine learning: “the authors admit that the papers that AI Scientist has produced contained only incremental developments. Some other people were scathing in their comments on social media. ‘As an editor of a journal, I would likely desk-reject them. As a reviewer, I would reject them,’ said one commenter on the online forum Hacker News.” (Castelvecchi, 2024) In mathematics, there is also contradictory evidence. Some tests point to the fact that certain of the claimed achievements of some models in solving Olympiad silver or gold medal level problems were due to “data leakage” (Olympiad is a global competition in maths): solutions had been published before and were accessed by the models (Petrov et al., 2025). On the other hand, very rigorous testing by professional mathematicians showed that o4.mini, an OpenAI reasoning model, can solve most of the PhD-level problems they submitted, demonstrating extremely powerful reasoning capacities. However, there is consensus that current models are not up to the level of mathematical research, although they are getting closer (Chiou, 2025). Two further caveats need to be mentioned. First, not all negative results of testing are necessarily published, and limitations of the models could be underestimated; second, it is difficult to estimate the quantity of human involvement in the models’ work, but it could be sometimes important, and the role of the models could then be over-estimated. However , these models are still in a very early stage of development and much progress is to be expected in the near future.

The AI systems examined above keep to the cognitive tasks of research: analysing and generating information. Other systems go one step further and aim at performing the whole range of tasks of a research assistant, notably the design of experiments (research assistants), or even the realisation of experiments (robot scientists)

There has been a surge in the supply of AI research assistants since 2024 (see Box 13.4). They can be compared to the teaching assistants described by (Baker, 2026), even though their functionalities and workings are different. The characteristics shared by AI research assistants are the following: 1) they perform all tasks expected from a research assistant: reviewing literature, generating hypotheses, designing experiments, drafting articles; 2) they are technically similar to the hypothesis generation models examined above (multi-agents, etc.); 3) they are very interactive, as their functioning involves frequent and important exchanges with a supervising human, who remains in close control of the research process; they are just “assistants” after all. According to So (2025), AI research assistants offer numerous benefits, including accelerated research timelines, 24/7 availability, personalised support, enhanced objectivity and improved accessibility for non-native English speakers. AI research assistants are evolving to support various collaboration models, from passive assistants to full research partners. Despite their impressive capabilities, AI research assistants face significant challenges, including generating inaccurate information, limitations in critical analysis, and ethical concerns around plagiarism and attribution.

The “robot scientist” marks the automation of the last step in the research cycle, the performance of experiments. The robot scientist works by connecting laboratory equipment to an AI system: the AI designs the experiments and controls the equipment so that it performs these experiments. Why automate experiments? Here is one example. According to pioneer Ross King (2024): Studying eukaryotic system biology is a complicated task as even simple eukaryotic cells such as yeast have thousands of genes, proteins, and other small molecules, that interact in a complex spatial and temporal manner. The high complexity of the models means that their development and evaluation require the execution of millions of experiments based on a hypothesis. Only AI systems with automated labs have the capacity to plan, conduct, and monitor such a high number of experiments.

In this case, the robot allows researchers to perform experiments that would be beyond human capabilities. An additional advantage of robot labs is that their experiments generate large quantities of high-quality, controlled data that can be used to train AI models. One example in chemistry is CRESt (Ren et al., 2023): users exchange with CRESt like with a colleague, in natural language. CRESt helps to craft and run experiments by retrieving and analysing data, turning equipment on and off, powering robotic arms, documenting findings and alerting scientists when something requiring their attention arises. CRESt-assisted researchers identified candidate alloys for fuel cells. While important for many scientific fields, the automation of laboratory work seems less relevant for education research, where most of the experiments involve humans in a controlled or real-life environment. This may however support work about chemistry, biology or neuroscience that will help understand how human learns or where some impediments can come from,

Table 13.1 presents a summary of the possible roles of GenAI at the different steps of the research process, as well as their current achievements and limitations.

Search This Blog

International Day of Education

The use of GenAI in scientific research.

Comments

Post a Comment

Popular posts from this blog

(Day 2) Beyond the Algorithm: Navigating the Future of Artificial Intelligence - 49th Annual UNIS-UN International Student Conference.

Ensure that AI complements, rather than replaces, the essential human elements of learning.

(Day 1 - Part 2) Beyond the Algorithm: Navigating the Future of Artificial Intelligence - 49th Annual UNIS-UN International Student Conference.