Treffer: Machine Learning Literacy for Measurement Professionals: A Practical Tutorial
1745-3992
Weitere Informationen
The COVID-19 pandemic has accelerated the digitalization of assessment, creating new challenges for measurement professionals, including big data management, test security, and analyzing new validity evidence. In response to these challenges, "Machine Learning" (ML) emerges as an increasingly important skill in the toolbox of measurement professionals in this new era. However, most ML tutorials are technical and conceptual-focused. Therefore, this tutorial aims to provide a practical introduction to ML in the context of educational measurement. We also supplement our tutorial with several examples of supervised and unsupervised ML techniques applied to marking a short-answer question. Python codes are available on GitHub. In the end, common misconceptions about ML are discussed.
As Provided
AN0162672153;ems01mar.23;2023Mar28.04:48;v2.2.500
Machine Learning Literacy for Measurement Professionals: A Practical Tutorial
The COVID‐19 pandemic has accelerated the digitalization of assessment, creating new challenges for measurement professionals, including big data management, test security, and analyzing new validity evidence. In response to these challenges, Machine Learning (ML) emerges as an increasingly important skill in the toolbox of measurement professionals in this new era. However, most ML tutorials are technical and conceptual‐focused. Therefore, this tutorial aims to provide a practical introduction to ML in the context of educational measurement. We also supplement our tutorial with several examples of supervised and unsupervised ML techniques applied to marking a short‐answer question. Python codes are available on GitHub. In the end, common misconceptions about ML are discussed.
Keywords: automated marking; data science; educational measurement; machine learning; tutorial
In the 21st century, educational assessments have become increasingly digitalized; a phenomenon accelerated more recently due to the COVID‐19 pandemic. Many assessment organizations were forced to move their examinations to remote online settings. This massive paradigm shift created many challenges and opportunities. For example, validity evidence needs to be re‐evaluated using new forms of data (e.g., big process data, videos), test security concerns have become increasingly pronounced due to remote testing, and previous human resources may not suffice to accommodate the greater demands in examination content and operations.
With a more significant workload and (potentially) fewer resources,
However, most ML courses are designed for computer scientists. As a result, they tend to focus on mathematical derivations and algorithm implementations. However, in practice, most ML studies are conducted using well‐established ML packages such as scikit‐learn (Pedregosa et al., [28]). Technical implementations of ML algorithms are rarely needed. Accordingly, the authors believe measurement specialists can first focus on understanding the purposes and applications of ML concepts rather than technical details. Using an analogy, most people do not need to know the mechanics of cars to drive, and those who know the mechanics may not be the best drivers. Therefore, this tutorial aims to introduce ML from a practical perspective: explaining ML using traditional statistical concepts that measurement specialists are familiar with, highlighting the workflow of a ML study without going into technical details, and demonstrating ML using applied examples in educational measurement.
The tutorial is divided into three sections. The first section introduces the basic ML concepts, processes and recommended programming language and software. The second section provides demonstrations of supervised and unsupervised ML using an automated marking example. The Python scripts used for this demonstration are provided on GitHub: https://github.com/rui‐nie/automark. Finally, the third section discusses common misconceptions about ML and concludes the tutorial. The definitions for all the ML terminologies covered in this paper can be found in Appendix A.
Introduction to Machine Learning
Basic Concepts
A popular definition of ML is "
In contrast, while ML also has the model specification and parameter estimation (i.e., Training) steps, most ML studies do not involve deriving parameter probability distribution. By removing the most mathematics‐demanding step, ML is free to specify more complex models for better prediction power at the cost of ignoring parameter significance. Another important distinction is that model specification is theory‐driven in traditional statistics, while the model specification is often data‐driven in ML. As a result, ML researchers need to experiment with different model architectures to identify the best model for a problem. At last, many ML parameter estimation techniques are designed for big and sparse data. One way to characterize
Several similar concepts are related to ML, including
There are two major types of ML[1]:
Three popular classes of supervised learning algorithms are bagging algorithms (e.g., random forest; Breiman, [8]), boosting algorithms (e.g.,
While supervised learning algorithms have the capacity to model highly nonlinear relationships, they may also capture irrelevant noises in the sample data. This phenomenon is called
Unsupervised ML focuses on modeling data without labels by creating latent variables (e.g., clusters, factors, components; Wong, [39]). In statistics, latent variables are unobserved variables inferred from other observed variables. In the context of ML, they help group features or records. Basic techniques include principal component analysis for grouping features and cluster analysis for grouping records. Again, these techniques are commonly used in educational measurement research. While there are numerous unsupervised algorithms in ML, most can be classified as either dimension reduction or cluster analysis techniques.
Machine Learning Workflow
A ML study consists of several major steps, forming a workflow. We present a general ML workflow in Figure 1 and describe each component in the following sections.
Project goals and data design
A ML study begins with clear definitions of project goals and a proper data design and data collection strategy. A ML study will fall short without good‐quality data. Therefore, careful planning of the data model and the data collection strategy can prevent many issues in later data analysis (Hao & Mislevy, [23]). For example, to develop a ML solution for automated scoring, it may be helpful to follow a framework such as the
Exploratory data analysis
An important ML study task is to explore the data set to better understand the data. This process is called
Data preprocessing and feature extraction
Raw data often need to be transformed before being used in a ML model. This step is called
Text data are naturally unstructured and require special preprocessing. First, punctuations and commonly used words (i.e., s
Specific feature extraction techniques are also applied to text data. That can be done by converting text data into a feature vector of word frequencies (i.e., the
Model selection and evaluation
The previous section introduced various data preprocessing steps to transform raw data into a list of feature variables. These are then input into a ML model to predict the label (supervised learning) or create latent features (unsupervised learning). Since there are many good‐performing ML models, it is common to train and compare several models to identify the best one for the problem. Therefore, it is important to establish a strategy to specify ML models, evaluate them, and select the best candidate.
The formal strategy starts with building a ML
ML researchers often use a data‐driven approach, such as
Different ML pipelines are then evaluated and compared using various metrics. The most intuitive evaluation metric for classification problems is a
At last, overfitting the training data is a critical consideration when evaluating ML models (Hao, [22]). To select the most generalizable model, ML researchers often randomly divide a data set into training, validation, and test sets. First, the training set is used to train the ML models/pipelines and fit model parameters. Next, the trained models/pipelines are evaluated using the validation set. Based on validation set performance, the best model/pipeline (with the best set of hyperparameters) is selected, and its performance on the test set is reported as the final evaluation of the model. When the training‐validation split occurs only once, and only one validation set is created, it is called a
After the best ML model is identified,
Programming Language and Tools
Python is one of the most popular ML and general programming languages, with numerous packages and large and active support communities. Since Python is, first and foremost, a general programming language, it is very suitable to implement ML solutions for industries. R programming language also offers several ML packages. Compared with Python, R was initially designed for statistical programming and does not necessarily offer all the flexibility and functionalities of a general programming language. However, it has many statistics and psychometrics packages, which makes it popular among educational measurement professionals and an excellent tool for academic research and education.
Programmers often use
For educational researchers who prefer a Graphical User Interface (GUI) over coding, WEKA is a great and free tool for conducting ML analysis (Frank et al., [20]). Compared with Python, WEKA does not require users to have a programming background; therefore, it is great for teaching and conducting simple ML analysis. In contrast, Python is more flexible and efficient for complex projects.
ML Communities and Resources
There are various online ML communities and resources. For overall guidance, GitHub offers a roadmap[2] for beginners to navigate various online ML resources and communities. For ML courses, Udemy offers practical courses, Coursera offers academic courses, and YouTube offers various free courses. For ML tutorials, Towards Data Science and Medium[3] are great websites. For ML‐related questions, there are many online communities to help, such as Reddit[4] and Stack Overflow.[5] For ML data, Kaggle offers data sets from diverse disciplines and allows ML researchers to compete for the best‐performing model. For sharing NLP models, Hugging Face (Wolf et al., [38]) is a great resource for state‐of‐the‐art NLP models.
Experimental Study
This section demonstrates the application of some popular supervised and unsupervised ML methods in developing automated marking solutions for short‐answer questions.
Problem Description
The question selected for these examples comes from a national medical licensing exam. In this exam, the examinees are presented with a clinical scenario with one or more questions and are asked to provide one or multiple short written responses to each question. Two human markers then evaluate each response's accuracy by matching it to the unique answer key. If a response does not match any given correct answer's answer key, the marker will match it to a dedicated "incorrect" answer key. If the two markers disagree, a super marker will resolve the discrepancy and make a final decision. The question scores are calculated automatically based on predetermined rules. The marking process described is labor‐intensive and vulnerable to several threats (fatigue, individual bias, drift over time, etc.). Thus, using automated marking to assist with the marking (e.g., replacing one human marker) is an appealing solution to reduce costs and improve reliability. The examples provided in this tutorial were part of a larger project that aimed to develop ML supervised and unsupervised solutions to classify short answers. For the sake of demonstration, we applied the ML solutions to a single short‐answer question, and we deliberately selected a question with a relatively low ML accuracy.
The short‐answer question was administered on a national medical exam that assesses a candidate's critical medical knowledge and clinical decision‐making ability at a level expected of a medical student who is completing their medical degree (construct model). For each clinical case and question, subject‐matter experts specify a list of appropriate clinical decisions, which are reflected in the list of acceptable and unacceptable answers (i.e., answer keys). The clinical problem, leading question and answer keys selected for this demonstration are displayed below.
A 23‐year‐old woman, gravida 2, para 1, aborta 0, comes to the office for a 22‐week prenatal visit. She is tearful and says that she cannot go through with the pregnancy. For several days, she has felt that she is not a very good mother to her 3‐year‐old daughter. She and her husband argue. She states that he is not supportive and that she "cannot please him anymore." The patient has difficulty sleeping, and her appetite has been poor. She also has difficulty concentrating and feels tired most of the time. She has no thoughts of harming herself or her family. She is physically healthy and moderately active. She had similar symptoms for 2 months after the birth of her first child. How will you manage this patient's care? List up to 2, or type in "None" if no management is indicated.
<bold>Answer key 1</bold>: Psychotherapy. <bold>Synonyms</bold>: Cognitive behavior therapy; interpersonal therapy; referral to a psychologist; referral to a psychiatrist; referral to mental health services; supportive therapy; group therapy. <bold>Not Acceptable</bold>: Biofeedback; counseling.
<bold>Answer key 2</bold>: Family therapy. <bold>Synonyms</bold>: Marital therapy; couple's counseling; marital counseling; couple therapy; referral to a social worker; supportive therapy. <bold>Not Acceptable</bold>: Biofeedback.
<bold>
Data Set
The demo data set is composed of one feature variable and one label variable. The feature variable is text and contains all the responses provided by examinees to the selected question. The label variable is a numerical identifier (answerkey_id) that specifies the answer key assigned by the super marker to each text response. The data set includes 750 records. The text responses to the selected question are relatively short, with a median number of words of 4 (first quartile: 2, third quartile: 6). The "answerkey_id" variable takes three values, representing the first correct answer, the second correct answer, and incorrect answers, with a corresponding distribution of 25%, 7%, and 68%. The data set was randomly divided into training/test sets in a 7/3 ratio, stratified using the "answerkey_id" label.
In this tutorial paper, we conducted all analyses using real data. However, on GitHub, we simulated some fake data for confidentiality reasons.
Data Preprocessing
For data preprocessing, we used Python functions to automatically (1) convert letters into lowercase; (2) remove all punctuations, leading, trailing and extra white spaces; (3) lemmatize words with the spaCy lemmatizer (Honnibal & Montani, [24]).[6]
Machine Learning Pipelines
In this tutorial, we provide a Python script for each ML pipeline. The following sections are organized using the Python script names (e.g., supervised_simple.py) and describe each pipeline. For all non‐deep learning pipelines, we used the scikit‐learn software library (Pedregosa et al., [28]); for all deep learning pipelines, we used Keras API (Chollet et al., [11]) with TensorFlow (Abadi et al., [1]) as the backend.
Supervised learning: non‐neural network pipelines
With large‐scale assessment programs, it is very common to field test (or pilot test) items before using them on operational exams. That is particularly helpful with short‐answer questions because it allows content developers and psychometricians to review and update answer keys. For ML applications, it also provides the opportunity to collect labeled data (marks or scores) and, therefore, to model the data using supervised learning techniques by predicting the label variable using the extracted features from the raw candidate responses. Below, we provide two examples of supervised learning pipelines.
<bold>
<bold>
Supervised learning: deep neural network pipelines
Deep neural networks have a special position in ML, as they are currently the best‐performing models in many challenging areas, such as machine translation and image recognition. In fact, in the text classification field, many current state‐of‐the‐art models are deep learning models. One example is XLNet, which we will demonstrate in the third example. The sophistication of deep learning models gives them the capacity to process the original word/character sequence without transforming them into
Because of the success of deep learning in text classification, we dedicate three examples to it in this section. The first two examples demonstrate how to construct a basic deep learning model, and the third example demonstrates how to apply a state‐of‐the‐art model to our problem. The first two examples use the same deep learning model, the difference is only in input features: the first used word unigram, and the second used character
<bold>
In this example, we also demonstrate one use of transfer learning by using GloVe word embeddings (Pennington et al., [29]) as the initialization values of our own word embedding step. To be more specific, the GloVe word embeddings we used are pretrained on large text corpora from Wikipedia and Newswire. Using pretrained word embeddings allows us to apply the "knowledge" learned from large text corpora to our problem. As a result, the pipeline can have a better training start than just using random values.
The search space of this pipeline consists of different choices of hyperparameters related to network architecture, regularization and optimization.
<bold>
<bold>
These imported models come with parameters already pretrained on large data. During application, the researcher can have these parameters fixed or freed. If fixed, the model will be applied to data without any training; if freed, the pretrained parameters will only serve as initialization values, and the model will be trained on the researcher's data to better suit the problem at hand. Either way, this is another example of transfer learning.
In this example, we imported the pretrained XLNet model (Yang et al., [40]) from a Hugging Face (Wolf et al., [38]) library and applied it to our data. XLNet is a deep learning model and is one of the current state‐of‐the‐art models in the text classification field. During application, we chose to free the model parameters to make the trained model better suit our data.
Unsupervised learning pipelines
While supervised learning algorithms can be applied to previously marked responses, they cannot be used with newly developed questions administered for the first time. That is because supervised learning requires labeled data to train model parameters. In this situation, unsupervised learning algorithms can be applied to augment human markers. For example, one promising solution for augmenting short‐answer marking is to arrange candidate responses in small clusters. Clustering techniques, such as
In practice, measurement professionals would use previously marked items to determine appropriate feature extraction methods (e.g., character or word
<bold>
Evaluation Metrics and Feature Importance
We used accuracy as the single evaluation metric for selecting models in the above ML pipelines. However, during testing, we added other evaluation metrics, including ROC curve and AUC,
For all non‐deep learning pipelines presented in this tutorial, we used fivefold cross‐validation, given the short training time. However, we used hold‐out validation for deep learning pipelines as the training phase was much slower. More specifically, the training data were randomly split into training/validation sets in a 7/3 ratio, stratified by the "answerkey_id" label.
Demonstration Results and Discussion
Table 1 presents the evaluation metrics of different models (and the two human markers) on the test set. The ROC curves are shown in Figure 2 in separate panels. The best‐performing pipeline on all four evaluation metrics and ROC curve (the curve is higher than other curves) is the CatBoost model with both bi and tri‐character grams without TFIDF (i.e., the best pipeline from random search in supervised_search.py). That is not surprising because CatBoost is a powerful yet beginner‐friendly algorithm, which does not require users to do extensive hyperparameter tuning. While with more extensive searches, it is possible for other algorithms to achieve better performance, the current result supports the efficiency of the CatBoost algorithm for this problem. One trend in our experiment is that deep learning models tended to perform slightly worse than other models and were also slower to train. That may be because deep learning models are designed for complex problems and are not as efficient for more straightforward tasks. Another trend is that character‐based models tended to outperform word‐based models. That is likely because character‐based models are robust to misspellings (human markers are told not to consider them). At last, we want to point out that most supervised methods have comparable or even better performance than human markers. One appealing property of supervised marking solutions is that they are consistent over time. This consistency is often harder to achieve for manual marking, for example, when different groups of markers are hired at each examination session.
1 Table The Evaluation Results of Different Pipelines and Two "Non‐Super" Human Markers on the Test Set
1 *Supervised_search.py (in bold) exhibited the best performing evaluation metrics. **Based on the evaluation scheme in the "unsupervised learning pipelines" section, some labels in the testing set were considered as given by human markers and were excluded from the evaluation. Consequently, the unsupervised_search pipeline has a smaller test set size than other pipelines.
The unsupervised methods performed worse than supervised learning on all metrics.[7] That is expected because the unsupervised automated marking techniques are designed for new questions without previous marking data. Additionally, the question used for this demonstration has many different responses, making the unsupervised marking technique less effective. Unsupervised methods would probably perform better if a question had fewer unique responses.
To demonstrate the use of feature importance scores for model validation, we present the ten most important features for the model in supervised_simple.py in Table 2. A word‐based model is used for demonstration because character‐based models' features are more difficult to interpret. As can be seen, the top features are all related to different forms of psychotherapies or couple therapies, which are consistent with the answer keys.
2 Table Top 10 Features Based on Feature Importance Score for the Model in supervised_simple.py
Discussion
This paper aimed to expand the toolkit of measurement specialists by providing a step‐by‐step tutorial on applying ML techniques. We focused on ML applications to the marking of short‐answer questions. However, there are many other potential applications of ML for assessment and measurement problems.
In this tutorial, we show that there are many parallels between ML and assessment science. In addition, the implementations of many state‐of‐the‐art ML algorithms are available in general programming languages (e.g., Python) and statistical programming languages (e.g., R) without any cost. Therefore, it is becoming easier for measurement specialists to learn and use ML toolkits without starting from scratch. Still, there are other potential barriers to embracing ML into the assessment and measurement community. This section debunks further some common myths and provides practical recommendations.
Misconception 1: ML Only Works for Big Data
A common misconception is that ML algorithms only work for big data (Cui, [16]). The tutorial shows that ML and statistics share many algorithms (e.g., regressions, principal component analysis, cluster analysis). Many of these algorithms can be adjusted to suit different data sizes. While prediction accuracy tends to increase as data size increases, ML models can still offer very good predictions with smaller data sets. For example, we used a data set of 750 records in the examples provided in this paper. More importantly, the data size requirement is more a function of the problem complexity and data quality than the ML algorithm. If a problem has a relatively simple underlying pattern, sample sizes in hundreds or thousands may be sufficient for ML to capture the underlying pattern. In contrast, if the underlying problem is complex (e.g., building a general language model) and the data contain lots of noise, big data (e.g., the entire Wikipedia) may be necessary to capture the underlying pattern. Additionally, unlike traditional statistics, it is acceptable for a ML model to have more parameters than data size (due to regularization). Sometimes, over‐parameterized ML models can lead to better performance in the testing set (Belkin et al., [5]). For these reasons, in a systematic review, Balki et al.'s ([4]) recommended using a post hoc method to determine the data size requirement: using increasingly higher proportions of data (20%, 30%, 40% etc.) to train the model and then plotting the errors for each data size. This data size versus error graph can be used to determine the data size needed for the desired error rate.
Misconception 2: Only Large Enterprises Can Afford ML
Some practitioners may be concerned that ML solutions are not affordable for smaller enterprises. The impression is that all ML problems are highly complex and require big data, a network of supercomputers and a devoted data science team. The reality is that ML can be applied to both simple and complex problems. For example, many problems in educational measurement (e.g., short‐answer scoring, topic analysis, and risk prediction) can be solved efficiently with existing ML packages. Furthermore, in modern times, various well‐developed ML packages and big data pretrained models are available for free. Practitioners do not need to "reinvent the wheel" to apply ML. Our demonstration is an example that shows how to apply a state‐of‐the‐art NLP model to help short‐answer marking. Accordingly, we recommend that smaller enterprises use similar approaches to explore ML solutions.
Misconception 3: One ML Algorithm Is Enough for Every Problem
Some ML beginners think every problem can be solved with a powerful ML algorithm. However, the effectiveness of a ML algorithm depends heavily on the data and the underlying problem. No matter how powerful a ML algorithm is, it will fall short if the data do not contain enough relevant information to the problem. For this reason, ECD is essential for the successful applications of ML in educational measurement. Additionally, a ML algorithm's effectiveness for a given problem depends on the problem's nature and complexity. Deep learning models outshine other models when the problem is complex and the data is big. When the problem is relatively simple or the data is not large enough, many simple models already perform well. The extra complexity of deep learning does not add much (sometimes, it is even worse due to being harder to train). However, when problem complexity and data size increase, many traditional models cannot keep up, but deep learning models scale very well due to the flexibility and scalability of their architecture. Therefore, we recommend that ML beginners start with the random forest and boosting algorithms for structured supervised problems and use deep learning for complex and unstructured NLP, video, and audio processing tasks. It is important to note that this recommendation is only a heuristic for beginners to get started with ML. No theory can predict exactly which algorithm will perform the best for a given problem.
Misconception 4: ML Solutions Are Developed to Replace Human Experts Completely
In the field of educational measurement, ML rarely completely replaces human experts. Instead, most ML solutions augment human experts by doing those tedious but less complicated tasks. The cost‐effectiveness and reliability of ML in doing such tasks allow human experts to focus on tasks requiring higher levels of judgment. However, in practice, fully automated ML solutions are often risky and controversial. That is partly because ML still cannot reach human performance on specific tasks (e.g., language comprehension) and partly because when mistakes are made, society tends to be more critical of computers' mistakes than humans' mistakes. For this reason, we recommend using ML to augment human experts rather than replace them.
Misconception 5: ML Models Are Black Boxes
Many researchers feel uncomfortable about ML because of its "atheoretical" nature: theories are not always needed to specify the model, and it is hard to understand what knowledge is extracted from the data. While this may be true in early ML research, modern ML has a variety of techniques that shed light on how ML models arrive at their final decisions. For example, meaningful features (e.g., number of grammatical errors) can be extracted from raw data using pretrained models, and feature importance can be calculated to help researchers understand which features contribute the most to the final decision. In addition, many ML models are based on the decision tree algorithm, which makes the ML decision‐making process explicit. Even complex deep neural networks have algorithms that help trace which parts of the input contribute the most to the final decision. In some situations, it may also be acceptable to prefer a less accurate but more interpretable model. In the examples provided in this tutorial, we compared models built on word
The implication for measurement professionals is that ML models' validity can be studied with careful planning. Consequently, we recommend analyzing the final model's feature importance whenever possible.
Limitations and Significance
Due to space limitations, this tutorial does not cover
A Appendix
Appendix A contains lists of machine learning glossaries covered in the main document. The aim is to help readers quickly look up a machine learning term A1‐A4.
A1 Table General ML Glossaries
A2 Table Feature Extraction Glossaries
A3 Table ML Model Glossaries
A4 Table ML Evaluation Glossaries
Footnotes
1 Reinforcement learning is another major class of ML that is not covered in this tutorial.
2 https://github.com/louisfb01/start-machine-learning
3 https://medium.com/tag/data-science
4 https://www.reddit.com/r/learnmachinelearning/
5 https://stackoverflow.com/questions/tagged/machine-learning
6 We acknowledge that preprocessing/altering raw candidate responses may raise ethical concerns. Consequently, we only use automated marking to augment human markers rather than replace human markers. Ultimately, it is the human markers who make the final decision.
7 Note that the test sets for supervised and unsupervised methods are slightly different due to the centroid responses in the unsupervised test set being removed. Nonetheless, the overall results still stand.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., ... Zheng, X. (2016, May 31). Tensorflow: A system for large‐scale machine learning. Retrieved from https://arxiv.org/abs/1605.08695
Almusharraf, N., & Alotaibi, H. (2020). Gender‐based EFL writing error analysis using human and computer‐aided approaches. Educational Measurement: Issues and Practice, 40 (2), 60 – 71. Portico. https://doi.org/10.1111/emip.12413
Anderson, D., Rowley, B., Stegenga, S., Irvin, S., & Rosenber, J. M. (2020). Evaluating content‐related validity evidence using a test‐based machine learning procedure. Educational Measurement: Issues and Practice, 39 (4), 53 – 64. https://doi.org/10.1111/emip.12314
Balki, I., Amirabadi, A., Levman, J., Martel, A. L., Emersic, Z., Meden, B., Garcia‐Pedrero, A., Ramirez, S. C., Kong, D., Moody, A. R., & Tyrrell, P. N. (2019). Sample‐size determination methodologies for machine learning in medical imaging research: A systematic review. Canadian Association of Radiologists Journal, 70 (4), 344 – 353. https://doi.org/10.1016/j.carj.2019.06.002
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine‐learning practice and the classical bias‐variance trade‐off. Physical Sciences, 116 (32), 15849 – 15854. https://doi.org/10.1073/pnas.1903070116
Bergstra, J. & Bengio, Y. (2012). Random search for hyper‐parameter optimization. Journal of Machine Learning Research, 13, 281 – 305.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123 – 140. https://doi.org/10.1007/BF00058655
8 Breiman, L. (2001). Random forests. Machine Learning, 45, 5 – 32. https://doi.org/10.1023/A:1010933404324
9 Burkhardt, A., Lottridge, S., & Woolf, S. (2020). A rubric for the detection of students in crisis. Educational Measurement: Issues and Practice, 40 (2), 72 – 80. Portico. https://doi.org/10.1111/emip.12410
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785 – 794). New York, NY, USA : ACM. https://doi.org/10.1145/2939672.2939785
Chollet, F.. (2015). Keras. GitHub. Retrieved from https://github.com/fchollet/keras
Clark, K., & Manning, C. D. (2016). Improving coreference resolution by learning entity‐level distributed representations. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long Papers). https://doi.org/10.18653/v1/p16‐1061
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20 (1), 37 – 46.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213 – 220.
Cortes, C., & Vapnik, V. (1995). Support‐vector networks. Machine Learning, 20 (3), 273 – 297.
Cui, Z. (2021). Machine learning and small data. Educational Measurement: Issues and Practice, 40 (4), 8 – 12. https://doi.org/10.1111/emip.12472
Ercikan, K., & McCaffrey, D. F. (2022). Optimizing implementation of artificial‐intelligence‐based automated scoring: An evidence centered design approach for designing assessments for AI‐based scoring. Journal of Educational Measurement, 59 (3), 272 – 287. Portico. https://doi.org/10.1111/jedm.12332
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27 (8), 861 – 874.
Ferrara, S. (2017). A framework for policies and practices to improve test security programs: Prevention, detection, investigation, and resolution (PDIR). Educational Measurement: Issues and Practice, 36 (3), 5 – 23. https://doi.org/10.1111/emip.12151
Frank, E., Hall, M. A., & Witten, I. H. (2016). The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques " (4th edn.). Burlington : Morgan Kaufmann.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189 – 1232.
Hao, J. (2021). Supervised Machine Learning. In A. A. von Davier, R. J. Mislevy, & J. Hao (Eds.), Computational psychometrics: New methodologies for a new generation of digital learning and assessment with examples in R and Python (pp. 159 – 171). Springer. https://doi.org/10.1007/978‐3‐030‐74394‐9
Hao, J., & Mislevy, R. J. (2021). A data science perspective on computational psychometrics. Computational Psychometrics: New Methodologies for a New Generation of Digital Learning and Assessment, 133 – 158. https://doi.org/10.1007/978‐3‐030‐74394‐9_8
Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Sentometrics Research.
Huang, Y., & Khan, S. M. (2021). Advances in AI and machine learning for education research. In A. A. von Davier, R. J. Mislevy, & J. Hao (Eds.), Computational psychometrics: New methodologies for a new generation of digital learning and assessment with examples in R and Python (pp. 195 – 208). Springer. https://doi.org/10.1007/978‐3‐030‐74394‐9
McNamara, D. S., Louwerse, M. M., McCarthy, P. M., & Graesser, A. C. (2010). Coh‐Metrix: Capturing linguistic features of cohesion. Discourse Processes, 47 (4), 292 – 330.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3 – 62. https://doi.org/10.1207/S15366359MEA0101_02
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., & Passos, A. (2011). Scikit‐learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825 – 2830.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Qatar, pp. 1532 – 1543. https://doi.org/10.3115/v1/D14‐1162
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2017). CatBoost: unbiased boosting with categorical features (Version 5). arXiv. https://doi.org/10.48550/ARXIV.1706.09516
Rafatbakhsh, E., Ahmadi, A., Moloodi, A., & Mehrpour, S. (2021). Development and validation of an automatic item generation system for English idioms. Educational Measurement: Issues and Practice, 40 (2), 49 – 59.
Russell, S., & Norvig, P. (2010). Artificial intelligence: A modern approach. (3rd edn.). Upper Saddle River : Prentice‐Hall.
San Pedro, M. O. Z., & Baker, R. S. (2021). Knowledge inference models used in adaptive learning. In A. A. von Davier, R. J. Mislevy, & J. Hao (Eds.), Computational psychometrics: new methodologies for a new generation of digital learning and assessment with examples in R and Python (pp. 61 – 77). Springer. https://doi.org/10.1007/978‐3‐030‐74394‐9
Savi, A. O., Cornelisz, I., Sjerps, M. J., Greup, S. L., Bres, C. M., & van Klaveren, C. (2021). Balancing trade‐offs in the detection of primary schools at risk. Educational Measurement: Issues and Practice, 40 (3), 110 – 124. Portico. https://doi.org/10.1111/emip.12433
Shane, L., & Marcus, H. (2007, Jun 15). A collection of definitions of intelligence. Retrieved from : https://arxiv.org/abs/0706.3639
Sheehan, K. M. (2017). Validating automated measures of text complexity. Educational Measurement: Issues and Practice, 36 (4), 35 – 43. https://doi.org/10.1111/emip.12155
von Davier, A. A., Mislevy, R. J., & Hao, J. (Eds.). (2021). Computational psychometrics: New methodologies for a new generation of digital learning and assessment with examples in R and Python. Springer. https://doi.org/10.1007/978‐3‐030‐74394‐9
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., ... Rush, A. M. (2020, Jul 14). HuggingFace's transformers: State‐of‐the‐art natural language processing. Retrieved from https://arxiv.org/abs/1910.03771
Wong, P. C. (2021). Unsupervised machine learning. In A. A. von Davier, R. J. Mislevy, & J. Hao (Eds.), Computational psychometrics: New methodologies for a new generation of digital learning and assessment with examples in R and Python (pp. 173 – 193). Springer. https://doi.org/10.1007/978‐3‐030‐74394‐9
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2020, Jan 1). XLNet: Generalized autoregressive pretraining for language understanding. Retrieved from https://arxiv.org/abs/1906.08237
Zhou, Z. ‐H. (2012). Ensemble methods: Foundations and algorithms (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b12207
By Rui Nie; Qi Guo and Maxim Morin
Reported by Author; Author; Author