Publication Type
Journal Article
Version
publishedVersion
Publication Date
6-2025
Abstract
Background: Increasingly, students are using ChatGPT to assist them in learning and even completing their assessments, raising concerns of academic integrity and loss of critical thinking skills. Many articles suggested educators to redesign assessments which are more “Generative-AI-resistant” and to focus on assessing students on higher order thinking skills. However, there is a lack of articles that attempt to quantify assessments at different cognitive levels to provide empirical study insights on ChatGPT’s performance at different levels, which will affect how educators redesign their assessments.Objectives: Educators need new information on how well ChatGPT performs to redesign future assessments to assess their students in this new paradigm. This paper attempts to fill the gap in empirical research by using spreadsheets modeling assessments, tested using four different prompt engineering settings, to provide new knowledge to support assessments redesign. Our proposed methodology can be applied to other course modules for educators to achieve their respective insights for future assessment designs and actions.Methods: We evaluated the performance of ChatGPT 3.5 on solving spreadsheets modeling assessment questions with multiple linked test items categorized according to the revised Bloom’s taxonomy. We tested and compared the accuracy performance using four different prompt engineering settings namely Zero-Shot-Baseline (ZSB), Zero-Shot-Chain-of-Thought (ZSCoT), One-Shot (OS), and One-Shot-Chain-of-Thought (OSCoT), to establish how well ChatGPT 3.5 tackled technical questions of different cognitive learning levels for each prompt setting, and which prompt setting will be effective in enhancing ChatGPT’s performance for questions at each level.Results: We found that ChatGPT 3.5 was good up to level 3 of the revised Bloom’s taxonomy using ZSB, and its accuracy decreased as the cognitive level increased. From level 4 onwards, it did not perform as well committing many mistakes. ZSCoT would achieve modest improvements up to level 5, making it a possible concern for instructors. OS would achieve very significant improvements for levels 3 and 4, while OSCoT would be needed to achieve very significant improvement for level 5. None of the prompts tested was able to improve the response quality for level 6.Conclusions: We concluded that educators must be cognizant about ChatGPT’s performance for different cognitive level questions, and the enhanced performance from using suitable prompts. To develop students’ critical thinking abilities, we provided four recommendations for assessments redesign which aim to mitigate the negative impact on student learning and to leverage it to enhance learning, considering ChatGPT’s performance at different cognitive levels.
Keywords
ChatGPT, Performance Evaluation, Spreadsheets Modeling, Assessments Redesign, Revised Bloom’s Taxonomy, Prompt Engineering
Discipline
Artificial Intelligence and Robotics | Educational Technology
Research Areas
Learning and Information Systems Education
Publication
Journal of Computer Assisted Learning
Volume
41
Issue
3
First Page
1
Last Page
17
ISSN
0266-4909
Identifier
10.1111/jcal.70035
Publisher
Wiley
Citation
CHEONG, Michelle L. F..
ChatGPT’s performance evaluation in spreadsheets modeling to inform assessments redesign. (2025). Journal of Computer Assisted Learning. 41, (3), 1-17.
Available at: https://ink.library.smu.edu.sg/sis_research/10172
Copyright Owner and License
Authors
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1111/jcal.70035