Friday, July 20, 2012

Is education a data-intensive science? Should it be?

Can collecting and sorting through massive amounts of learner data improve education? It's par for the course in some other fields. If you still think of astronomers scanning the skies, peering through enormous telescopes on mountaintop observatories, or studying digital pictures coming back from the Hubble telescope, think again.

"People now do not actually look through telescopes. Instead, they are "looking" through large-scale, complex instruments which relay data to datacenters, and only then do they look at the information on their computers." Tony Hey et al, The Fourth Paradigm

Astronomy is a data-intensive science. Defined loosely, that means that there's just way too much data even for scientists to manage. Used to be, scientists had to carefully craft an experiment in order to generate data. Their job was to create precise, specific data that would then be carefully added to someone else's precise, specific data, and that would go on in an additive fashion for a while, and eventually, hey! A conclusion could be drawn.

Not anymore. In many fields today, scientists are simply awash with the stuff. It's a tsunami of data, it's... more data than any one metaphor can hold. It's downloaded from instruments, from automated inputs, millions of nodes, devices, all connected to computer databases, or even generated by computer networks themselves.

Once it's set up and put in motion, all this information is collected automatically, sometimes at terabytes per second. CERN's supercollider generates over 1,000 terabytes of data per second, a petabyte of data. To give you an idea how much that is, it takes more than 400 high-definition movies to add up to 1 terabyte, and we're taking a thousand times that much... Imagine 400,000 HD movies generated out of thin air every second. Store that on your DVR.


What do they do with all the data? Well, they dump it. Boatloads of perfectly good data are erased every day, by really good scientists, too, because they just can't keep it. They can only keep the final tallies, the results, the reports. There's a whole science now developing across these fields that is just about how to manage data, called "e-science." Microsoft gives an award for advancing it. Alexander Szalay of Johns Hopkins won it last year.

To try to get a feel for the magnitude of this, imagine so many Major League Baseball teams, and so many games being played every day, that you couldn't keep all the recordings of them anywhere. The evening sportscast would go like this: "And in baseball, there were ten trillion National League games played today, and here are the scores: the winners trended toward the home teams once again, by a 54 percent margin. Congratulations, home teams!" Want to watch some highlights? Sorry, they're gone forever... the recording was erased the instant the last out was made.

Is education a data intensive science? No. But it could be. The amount of data generated by educational activities every day is stunning. But not a large percentage is captured. How much communications data is generated every day? Just think about the phone minutes alone, being tracked and logged and billed globally. And education as an industry is about three times the size of both the entertainment and communications industries. It's bigger than both combined. Those industries have gotten very good at collecting up data, all kinds of it, and using it. They use it for billing, marketing, product improvement, service improvement, competitive analysis, pricing, investment...

Education and training? We're not so good at collecting data. But we are getting better. The data being generated by learning is more and more being collected digitally and used for similar purposes--which may be good or bad, but is likely inevitable. But it's also being mined for loftier reasons. Arizona State University now uses data mining techniques to create student profiles that help guide students through their college careers. Check out this article from the Chronicle of Higher Education.

But ASU is just scratching the surface. A larger and larger percentage of the learning being done in schools, universities, and the workplace is being done using digital tools... online courses, digital learning objects, digital textbooks, simulations. Final grades are a tiny percentage of the data that is already being collected. Learner data that can be and is being collected during the trial and error of the actual learning experience, the clicks and drags and searches and reviews and posts and responses and ratings and submissions.

The global push toward measuring learning outcomes, toward judging the quality of education by what students can or can't do, what they do and do not achieve... that cannot now be disconnected, never again will be disconnected, from tracking and progress data. The era of a final, a mid-term, six quizzes and a paper adding up to the sum total of a student data? That's over. Collecting and using data to measure actual learning success, or lack of it, is exactly what makes high-profile efforts like Khan Academy work.

Should education be a data-intensive science? Moot question. It almost certainly will be. Better question: how do we make it as good as it can possibly be?