Automated Student Answer Scoring Using GloVe-LSTM and Hybrid Similarity Metrics
This study aims to develop and evaluate an automated scoring model for Indonesian student answers that enhances objectivity, accuracy, and adaptability – addressing persistent challenges in manual assessment, such as subjectivity, inconsistency, time inefficiency, and the increasing grading workload faced by teachers.
The proposed model combines GloVe word embeddings with a Long Short-Term Memory (LSTM) network, supported by evaluation algorithms including ROUGE Score, TF-IDF, and cosine similarity, to form a robust hybrid scoring system.
The methodology involves designing the model architecture, assembling a proprietary dataset of 3,420 student answers (processed into 3,152 samples) from online sources, practice books, and public repositories, applying standard NLP preprocessing techniques, and training the model using TensorFlow and Keras. A comparative baseline using a manually implemented LSTM with NumPy was also explored.
This research contributes a tailored hybrid model for automated scoring in the Indonesian language, providing a foundational analysis that highlights both the model’s potential and its key limitations, thereby informing future system improvements.
The model achieved a mean absolute error (MAE) of 0.0761 and a Pearson correlation of 0.8429, indicating a strong alignment with manual grading in terms of relative ranking. However, it tends to overestimate scores for low-quality or irrelevant responses. It struggles with the use of synonyms, variations in answer length, and minor linguistic errors.
As a proof-of-concept, the model shows promise as a supportive grading tool that can help reduce teachers’ correction workload and provide fairer, faster assessments in digital learning environments.
Future research should prioritize expanding the dataset’s size and diversity, enhancing architectural components, and integrating more advanced linguistic features. Investigating contextual embeddings, such as BERT, may also address current semantic limitations.
A reliable automated scoring system could significantly reduce teachers’ grading workload, enabling them to dedicate more time to qualitative learning activities and fostering fairer, more efficient assessments.
Further efforts should focus on enhancing the model’s precision, particularly in identifying and penalizing low-quality answers, through improved hybrid architecture design, rigorous hyperparameter optimization, and the exploration of more sophisticated embedding techniques.



Back