Investigating Reliability and Validity in Rating Scripts for Standardisation Purposes in Onscreen Marking

Investigating Reliability and Validity in Rating Scripts for Standardisation Purposes in Onscreen Marking

[featured_image]
  • Version
  • Download 161
  • File Size 183.49 KB
  • File Count 1
  • Create Date August 2, 2018
  • Last Updated August 2, 2018

Investigating Reliability and Validity in Rating Scripts for Standardisation Purposes in Onscreen Marking

This study investigated the reliability and validity of Assistant Examiners (AEs) in rating the standardised scripts used as benchmarks in onscreen marking (OSM) of the written component of Primary 6 English Language in the Territory-wide System Assessment in the Hong Kong Special Administrative Region. Marking criteria included ‘,Content’, and ‘,Language.’, Standardised scripts were employed for three purposes: 1) training markers, 2) qualifying markers before they started rating and 3) check-marking the markers at random intervals throughout the entire OSM period. Therefore, these standardised scripts played a vital role in monitoring the marking quality even with the cutting edge technology of OSM. Scripts were drawn from a stratified sample (N=250 students) from a total of some 580 participating schools with a student population of 72,000. Having all such scripts marked by all AEs (a total of 250 scripts) would have been time-consuming and induced ‘,rater fatigue’, which was likely to affect rater reliability. Therefore, ‘,overlapping marking’, was adopted where AEs only needed to rate less than 70 scripts each. Each rater had about 20 scripts overlapped another rater thus forming an unbroken chain of overlap. This data enabled correlations between expert panel ratings and AEs’, ratings and the Multi-faceted Rasch Model was run to calculate the ‘,fair average’, (FA) for all AEs and ‘,infit’, for each rater. To ‘,externally’, validate the ratings, verifiable quantitative measures (VQM) were used as a check which correlated against both FA and individual ratings. The VQM included ‘,number of meaningful clauses’,, ‘,syntactic complexity’,, ‘,lexical variation’,, ‘,families of words’,, etc. The results yielded correlations in the range of 0.6 to 0.9 for FA (α, <,0.05) and 0.4 to 0.8 for individual raters (α, <,0.05) showing that the method used in rating scripts for standardisation purposes was in most cases valid and reliable, especially when FA was used.

Attached Files

FileAction
paper_301719154.pdfDownload