帳號:guest(18.223.203.191)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):胡柏先
作者(外文):Hu, Bo Sien
論文名稱(中文):應用Mantel-Haenszel於BIB設計情境中進行DIF檢測時總分配對策略選擇之比較
論文名稱(外文):Comparison of Different Matching Strategies Using the Mantel-Haenszel Method to Detect DIF in the BIB Booklet Design
指導教授(中文):陳承德
指導教授(外文):Chen, Cheng Te
口試委員(中文):陳承德
鄒慧英
施慶麟
口試委員(外文):Chen, Cheng Te
Tzou, Hue Ying
Shih, Ching Lin
學位類別:碩士
校院名稱:國立清華大學
系所名稱:學習科學研究所
學號:102002501
出版年(民國):105
畢業學年度:104
語文別:中文
論文頁數:97
中文關鍵詞:Mantel-Haenszel差異試題功能BIB設計總分配對策略等化組合題本
外文關鍵詞:Mantel-HaenszelDIFBIB designMatching strategyEquated pooled booklet
相關次數:
  • 推薦推薦:0
  • 點閱點閱:98
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在大型測驗中,為了盡可能抽樣並描述受測學生群體的潛在特質,通常會採用題本設計的方式施測大量試題,但也會讓每道試題受測的樣本數減少,影響對差異試題功能(DIF)檢測的成效。隨著測驗的普及,應用門檻較低的非參數試題分析方法也較受實務工作者歡迎,在過去探討如何將非參數的Mantel-Haenszel方法應用於題本設計情境的研究中,總分配對策略被認為是左右DIF檢測成效的關鍵,但仍鮮有研究在較複雜的題本設計情境中,比較近年來所發展出的總分配對策略在DIF檢測成效上的表現,過往的研究在DIF檢測的流程中,通常沒有或有限制地使用量尺淨化程序,對題本間平均難度的差異操弄程度也很小。
為了釐清等化與未等化的配對策略,在題本間平均難度差異擴大時的檢測成效是否有所不同,並探討在不同配對策略下,加入基於當前DIF檢測結果的疊代量尺淨化程序後,對DIF檢測成效的影響,以及比較在PISA 2012主副向度的題本設計情境中進行DIF檢測的成效,本研究中將操弄樣本數、團體平均潛在特質水準差異(Impact)、DIF試題所占百分比、以及題本平均難度全距,並比較三種總分配對策略(區塊層次、百分比組合題本、等化組合題本)在PISA 2012主副向度的題本設計架構下,搭配基於當前DIF檢測結果的量尺淨化程序,進行DIF檢測的檢定力和型一錯誤率。
研究結果顯示,三種配對策略在副向度時的檢定力高於主向度、在相同向度、樣本數相當的情境中,區塊層次配對策略的檢測成效會隨著DIF百分比的提高而下降,百分比組合題本策略的檢則成效則會受到Impact、題本間平均難度全距、DIF百分比三因子間的交互作用影響而降低,等化組合題本配對策略在各情境中對型一錯誤的控制與檢定力的表現都優於另外兩者;在本研究中亦進一步使用搭配量尺淨化程序的等化組合題本配對策略,對PISA 2012台灣區數學向度真實資料,進行以性別為分組變數的DIF分析,結果顯示約有30%的試題被認為具有差異試題功能;基於研究結果,未來如需在BIB題本設計中進行DIF檢測,建議可以使用搭配量尺淨化程序的等化組合題本配對策略,以期在題本間平均難度差異較大,或測驗中DIF試題所佔百分比較多的情境中,仍可獲得較可靠的檢測結果。
In large scale assessment programs, the method of booklet design is commonly adopted for sampling a large amount of test items and describing latent traits of student participants. The missingness derived from booklet design not only reduce the number of samples responding to each tested item, but also cause vacancy in participants’ responses, which in turn harms the result of DIF assessment. Recently, an increasing attention has been drawn to a non-parametric DIF assessment method called the Mantel-Haenszel test (MH) due to its simplicity. Although past studies have found the matching strategy was crucial to the effectiveness of MH-DIF assessment in tests adopting booklet design, researches comparing the effectiveness of various matching strategies developed recently within a more authentic but complex booklet design context areis relatively rare. Furthermore, the DIF assessments in previous studies are often conducted based on a limited, or even without, scale purification procedure, and the differences of mean difficulties between booklets are too small to influence the result of DIF assessment.
In this research, three research questions were raised. First, can we amplify the differences in MH-DIF results using equated pooled booklet matching strategy and the other matching strategies by increasing the difference of mean difficulty between booklets? Second, if the matching variable is iteratively purified according to presumed results of DIF assessment instead of true DIF items, will this purification procedure affect the performance of DIF assessment among various matching strategies? Third, what is the difference in the results of DIF assessments between various matching strategies in both main and sub-dimension of PISA 2012 booklet design, respectively?
In this researchstudy, the sample size, impact, percentage of DIF items, and the range of mean item difficulty between booklets are manipulated. The booklet design adopted in this study research followed authentic PISA 2012, and we the matching variable based on presumed DIF assessment is iteratively purified matching variable based on presumed DIF assessment. Type I error rate and power of DIF assessment using three matching strategies, namely block level, percent pooled booklet, and equated pooled booklet, are recorded.
The findings indicated that the power rates in sub-dimension were higher than that those in main dimension among three matching strategies. Controlling for dimension and sample size, the power rates of block level matching strategies became lower when there were more DIF items in the test. The power rates of percent pooled booklet were affected by the three way interaction of impact, range of mean item difficulty between booklets, and percentage of DIF items. Comparing to the previous two strategies, equated pooled booklet strategy yielded the most ideal Type I error rate and the highest power rate in all scenarios. Furthermore, a real data example derived from PISA 2012 math test of Taiwan was analyzed for gender DIF using the equated pooled booklet strategy. Approximately 30% of the items were deemed to be DIF. According to this research, equated pooled booklet strategy with iterative purification procedure is strongly recommended in DIF assessment, especially when there are huge differences of mean difficulty between booklets, or when a lot of DIF items are expected in tests.
摘要 i
Abstract ii
謝誌 iii
目錄 iv
表目錄 vi
圖目錄 vii
第一章 緒論 1
第一節 大型測驗計畫的背景與型態 1
第二節 大型測驗計畫與測驗公平性 1
第三節 差異試題功能與檢測方法 2
第四節 研究動機與目的 3
第五節 待答之研究問題 5
第二章 文獻探討 6
第一節 大型測驗與題本設計 6
第二節 差異試題功能 11
第三節 PISA 2012中的差異試題功能檢測 12
第四節 Mantel-Haenszel法與Exact Test 13
第五節 量尺淨化 15
第六節 MH在題本設計中的應用與相關研究進程 17
區塊層次 19
題本層次 20
組合題本層次 20
百分比組合題本配對策略 21
等化組合題本配對策略 22
測驗等化 23
第七節 研究假設 27
第三章 研究方法 28
第一節 資料產生 28
第二節 研究設計 29
第三節 資料分析 34
第四章 研究結果 36
第一節 主向度之型一錯誤 36
第二節 主向度之檢定力 42
第三節 副向度之型一錯誤 48
第四節 副向度之檢定力 53
第五節 真實資料分析 58
第五章 討論與結論 63
第一節 研究問題回顧與研究結果總結 63
三種總分配對策略在各項實驗情境搭配下的DIF檢測成效趨勢 63
三種配對策略在主副向度之間的檢測結果比較 64
第二節 與相關研究的連結與比較 66
第三節 結論與建議 69
第四節 研究限制與未來發展 69
參考文獻 71
附錄 77
附錄A、PISA 2012 數學與科學向度試題參數 77
附錄B、等化組合題本配對策略執行程式碼 78
附錄C、三種總分配對策略在主向度上各操弄情境搭配中的型一錯誤率 86
團體平均潛在特質水準差異(Impact)為0.0時 86
團體平均潛在特質水準差異(Impact)為0.5時 87
團體平均潛在特質水準差異(Impact)為1.0時 88
附錄D、三種總分配對策略在主向度上各操弄情境搭配中的檢定力 89
團體平均潛在特質水準差異(Impact)為0.0時 89
團體平均潛在特質水準差異(Impact)為0.5時 90
團體平均潛在特質水準差異(Impact)為1.0時 91
附錄E、三種總分配對策略在副向度上各操弄情境搭配中的型一錯誤率 92
團體平均潛在特質水準差異(Impact)為0.0時 92
團體平均潛在特質水準差異(Impact)為0.5時 93
團體平均潛在特質水準差異(Impact)為1.0時 94
附錄F、三種總分配對策略在副向度上各操弄情境搭配中的檢定力 95
團體平均潛在特質水準差異(Impact)為0.0時 95
團體平均潛在特質水準差異(Impact)為0.5時 96
團體平均潛在特質水準差異(Impact)為1.0時 97
王文中(2004)。Rasch 測量理論與其在教育 和心理之應用。教育與心理研究,27(4),637–694
王文中、陳承德(譯)(2008)。心理測驗(原作者:K. R. Murphy., & C. O. Davidshofer)。臺北市:雙葉書廊。(原著出版年:2001)
余民寧(2009)。試題反映理論IRT及其應用。臺北市:心理。
余民寧、謝進昌(2006)。國中基本學力測驗之DIF的實徵分析:以91年度兩次測驗為例。教育學刊,26,241-276
郭伯臣(2010)。測驗等化。載於譚克平等人(主編),測驗及評量專論文集-題庫建置與測驗編制(初版,102-134頁)。臺北縣:國家教育研究院籌備處。
臺灣PISA國家研究中心(2010,7月)。計畫概述。2016年1月28日,取自:臺灣PISA國家研究中心網頁:http://pisa.nutn.edu.tw/pisa_tw.htm
Agresti, A. (1996). An introduction to categorical data analysis. New York: Wiley.
Albano, A. D. (2014). equate: An R Package for Observed-Score Linking and Equating. R package version, 2.
Allen, N. L., & Donoghue, J. R. (1996). Applying the Mantel-Haenszel Procedure to Complex Samples of Items. Journal of Educational Measurement, 33(2), 231–251.
Bradley, D. R., Bradley, T. D., McGrath, S. G., & Cutcomb, S. D. (1979). Type I error rate of the
chi-square test in independence in R× C tables that have small expected frequencies.
Psychological Bulletin, 86(6), 1290-1297.
Cheng, Y., Chen, P., Qian, J., & Chang, H.-H. (2013). Equated Pooled Booklet Method in DIF Testing. Applied Psychological Measurement, 37(4), 276–288.
Chen, J.-H., Chen, C.-T., & Shih, C.-L. (2014). Improving the Control of Type I Error Rate in Assessing Differential Item Functioning for Hierarchical Generalized Linear Model When Impact Is Presented. Applied Psychological Measurement, 38(1), 18–36.
DeMars, C. E. (2010). Type I error inflation for detecting DIF in the presence of impact. Educational and Psychological Measurement, 70(6), 961–972.
Donoghue, J. R., Holland, P. W., & Thayer, D. T. (1993). A Monte Carlo study of factors that affect the Mantel-Haenszel and standardization measures of differential item functioning. Differential Item Functioning, 137–166.
Dorans, N. J. (1990). Equating Methods and Sampling Designs. Applied Measurement in Education, 3(1), 3-17.
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland and H. Wainer (Eds.), Differential item functioning (pp.35-66). Hillsdale, NJ: Lawrence Erlbaum Associates.
Dorans, N. J., & Holland, P. W. (2000). Population Invariance and the Equatability of Tests: Basic Theory and the Linear Case. Journal of Educational Measurement, 281-306.
Dorans, N. J., Liu, J., & Hammond, S. (2008). Anchor test type and population invariance: An exploration across subpopulations and test administrations. Applied Psychological Measurement, 32(1), 81–97.
Fidalgo, A. M., Mellenbergh, G. J., & Muñiz, J. (2000). Effects of amount of DIF, test length, and purification type on robustness and power of Mantel-Haenszel procedures. Methods of Psychological Research Online, 5(3), 43–53.
Fidalgo,A. M., & Madeira, J. M. (2008). Generalized Mantel-Haenszel methods for differential item functioning detection. Educational and Psychological Measurement, 68(6), 940-958.
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278–295.
Frey, A., Hartig, J., & Rupp, A. A. (2009). An NCME Instructional Module on Booklet Designs in Large-Scale Assessments of Student Achievement: Theory and Practice. Educational Measurement: Issues and Practice, 28(3), 39–53.
Goodman, J. T., Willse, J. T., Allen, N. L., & Klaric, J. S. (2011). Identification of differential item functioning in assessment booklet designs with structurally missing data. Educational and Psychological Measurement, 71(1), 80-94.
Hidalgo, M. D. (2004). Differential Item Functioning Detection and Effect Size: A Comparison between Logistic Regression and Mantel-Haenszel Procedures. Educational and Psychological Measurement, 64(6), 903–915.
Hu, B. S., & Chen, C. T. (2015, March). Applying Double Purification Procedure for Differential Item Functioning on Large Scale Assessments. Paper session presented at The Fifth Asian Conference on Psychology & Behavioral Sciences, Osaka, Japan.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer and H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum Associates.
Kolen, M. J., & Brennan, R. L. (1987). Linear equating models for the common-item nonequivalent-populations design. Applied Psychological Measurement, 11(3), 263–277.
Kopf, J., Zeileis, A., & Strobl, C. (2015). Anchor Selection Strategies for DIF Analysis: Review, Assessment, and New Approaches. Educational and Psychological Measurement, 75(1), 22–56.
Lee, H., & Geisinger, K. F. (2015). The Matching Criterion Purification for Differential Item Functioning Analyses in a Large-Scale Assessment. Educational and Psychological Measurement, 76(1), 141-163.
Le, L. T. (2009). Investigating gender differential item functioning across countries and test languages for PISA science items. International Journal of Testing, 9(2), 122–133.
Little, R. J., & Rubin, D. B. (1989). The analysis of social science data with missing values. Sociological Methods & Research, 18(2-3), 292–326.
Li, Z. (2015). A Power Formula for the Mantel-Haenszel Test for Differential Item Functioning. Applied Psychological Measurement, 39(5), 373–388.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Magis, D., & De Boeck, P. (2014). Type I Error Inflation in DIF Identification With Mantel-Haenszel: An Explanation and a Solution. Educational and Psychological Measurement, 74(4), 713–728.
Mazor, K. M. (1994). Identification of Nonuniform Differential Item Functioning Using a Variation of the Mantel-Haenszel Procedure. Educational and Psychological Measurement, 54(2), 284-291.
Organization for Economic Cooperation and Development (OECD). (2014). PISA 2012 Technical Report. Paris: Author
Parshall, C. G., & Miller, T. R. (1995). Exact Versus Asymptotic Mantel-Haenszel DIF Statistics: A Comparison of Performance Under Small-Sample Conditions. Journal of Educational Measurement, 32(3), 302–316.
Preece, D. A. (1990). Fifty years of Youden squares: a review. Bulletin of the Institute of Mathematics and its Applications, 26(4), 65–75.
Revelle, W. (2015). Using the psych package to generate and test structural models. Retrived from http://bioconductor.statistik.tu-
dortmund.de/cran/web/packages/psych/vignettes/psych_for_sem.pdf
Sandilands, D. A. (2014). Accuracy of differential item functioning detection methods in structurally missing data due to booklet design. (Unpublished doctoral dissertation). The University of British Columbia, Vancouver, Canada.
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194.
Shih, C.-L., & Wang, W.-C. (2009). Differential item functioning detection using the multiple indicators, multiple causes method with a pure short anchor. Applied Psychological Measurement, 33(3), 184–199.
Su, Y.-H., & Wang, W.-C. (2005). Efficiency of the Mantel, Generalized Mantel–Haenszel, and logistic discriminant function analysis methods in detecting differential item functioning for polytomous items. Applied Measurement in Education, 18(4), 313–350.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.
Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of
group differences in trace lines. In H. Wainer & H. I. Braun (Eds.),Test validity (pp. 147-170). Hillsdale, NJ: Lawrence Erlbaum
Wald, A. (1943). Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large. Transactions of the American Mathematical Society, 54(3), 426-482.
Wang, W.-C., Shih, C.-L., & Sun, G.-W. (2012). The DIF-free-then-DIF strategy for the assessment of differential item functioning. Educational and Psychological Measurement, 72(4), 687–708.
Wang, W.-C., & Su, Y.-H. (2004). Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method. Applied Measurement in Education, 17(2), 113–144.
Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27(6), 479–498.
Woods, C. M. (2008). Empirical Selection of Anchors for Tests of Differential Item Functioning. Applied Psychological Measurement, 33(1), 42–57.
Youden, W. J. (1937). Use of incomplete block replications in estimating tobacco-mosaic virus. Contributions from Boyce Thompson Institute, 9, 41–48.
Youden, W. J. (1940). Experimental designs to increase accuracy of greenhouse studies. Contributions from Boyce Thompson Institute, 11, 219–228.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *