Implications and Challenges to Using Data Mining in Educational Research in the Canadian Context
ElAtia, Samira, Ipperciel, Donald, Hammad, Ahmed, Canadian Journal of Education
According to The Economist (2011), based on an EMC/IDC Go-to-the-Market study, there were 130 exabytes (1) of information generated in 2005; this number is forecasted to increase to 2,720 exabytes in the year 2012 and be triple that amount in 2015, at which point it is predicted to reach 7,910 exabytes (EMC/IDC 2011). Data are generated daily on every aspect of our lives. If used and analyzed properly, even though this data "will flood the planet [...] it will help us understand it better" (Big data, 2011).
In institutions of higher education, the trend of growing amounts of data continues. For Siemens and Long (2011), "the most dramatic factor shaping the future of higher education is something that we can't actually touch or see: big data and analytics'" (p.1). Large amounts of data from a variety of sources are collected daily on classes, students, administration, faculty members, programs of study, etc. Most of this generated data goes unprocessed. The little that is processed is confined to a specific inquiry or targeted research question. None of it is looked at from a 'big picture' perspective that combines all that is collected. Data are not inter-linked and are independent from each other. As a result, potentially important and valuable information is lost. Data is not stored nor treated as a single large entity in which more variables could be included and trends revealed.
The main cause for this situation and the loss of valuable information is the lack of a systematic approach for collecting, storing, codifying, and analyzing this data. This data, in the majority of cases, is initially not collected nor coded properly. It is stored away in formats that do not allow much analysis or extraction of useful knowledge. It is analyzed at the level of smaller units in which only interested parties can take advantage of it and, most importantly, in which research questions and variables are already pre-determined. Moreover, this data is completely unconnected and is stored in ways that do not allow any relationships to be built or discerning trends to be recognized.
Yet, if the same data were available to a larger research audience--in a format accessible to all, as well as being stored and coded in an integrated way so as to allow diverse academic users to access it, add to it, and analyze it according to their own perspectives--a wealth of knowledge could be harvested from this scattered data. In the current situation, different departments and units within universities collect and store pertinent data in different formats throughout the academic year. The procedure is time-consuming, and often costly. It is a huge loss for educational purposes that very large sets of collected data are hardly analyzed and are not transferred to useful knowledge that could be taken advantage of to address challenges in educational research for the 21st century.
In this context, this article aims to address the following questions. Regardless of the uniqueness of each program within the university,
1. Would it be feasible to develop an integrated data acquisition system for collecting and storing data from all departments and units within a university?
2. Can the collected data be converted to useful knowledge and provide new insights into educational research?
3. Can such practices be done in a way that does not infringe on legal issues relating to privacy and confidentiality?
Defining Data Mining and Knowledge Discovery in Data models
Knowledge discovery in data. In a multifaceted environment in which data comes from different sources and in different shapes, the concepts of Knowledge Discovery in Data (KDD), data warehousing and data mining offer an alternative for learning from this data. These techniques are eclectic in nature and combine qualitative as well as quantitative research approaches; they also allow researchers to work with large amounts of data that are impacted by a large number of unknown variables. …