Data Sets

One way to polish your skills as a researcher is to expose yourself to many different data sets. Each data set has its own structures and peculiarities and practicing data cleaning and analysis on many will broaden your capacity to work with data in any form. Such skills include merging files, reorganizing disparate data structures, importing different file types, processing missing data, identifying incorrectly coded information, and extracting relevant data. See below a list of openly accessible data sets that are not only good examples of realistic data, but also data sets you will probably work with during your academic career. They are openly accessible to some extent. If you had suggestions for another data set that belongs on this page, please send suggestions to stathelp@gse.harvard.edu.

CCD: https://nces.ed.gov/ccd/ccddata.asp
The CCD administers a survey every year to all public schools in the country, gathering information on demographics, finances, and other school characteristics at various levels. The CCD also administers a private school survey every other year with more limited information. This data set is particularly relevant when the unit of analysis is the school, school district, or state. Note that file type availability and survey questions are somewhat inconsistent across years. All data is openaly accessible.

NAEP: https://nces.ed.gov/nationsreportcard/researchcenter/datatools.aspx
Nicknamed "the nation's report card", the NAEP provides the backbone of national educational research by adminstering a national test since 1969. Rather than test everybody, the test is given to intentionally sampled schools and then carefully weighted to represent the nation as a whole. The NAEP serves as the gold standard for academic achievement data for investigations of national trends, as well as state-level comparisons. (NAEP data is the foundational data for the SEDA data set (see below)). National and state level data are openly accessible from years 1990 onwards, and some data prior to 1990 is available with basic permissions. Micro-level data requires restricted access.

NHGIS: https://www.nhgis.org/
IPUMS curates the largest database of census microdata, and the NHGIS provides already organized data sets and tables through an easy to use interface. You can specify data sets, topics, levels of geography, and/or years of interest and the data finder will compile all the relevant data, merged and collapsed, into a single data set for you. This is a fantastic resource when the unit of analysis is a geographic area, such as a neighborhood or a census tract. All data is openly accessible.

SEDA: https://cepa.stanford.edu/seda/data-archive
SEDA is an answer to a glaring issue in US education research: our "standardized" tests are usually only standard within a state or a district; otherwise, they're all different. This makes comparisons across, say, districts lines, exceedingly difficult, as the standards of proficiency in one district may not match up with the same in another district. SEDA chains together results on different exams via commonly administered exams (like NAEP), puts those scores on the same scale for multiple subjects, for multiple demographics, and for two grades, 4th and 8th grade. District level data sets are openly accessible, and school level data sets are available with special permission. For nation and state level data, consider using NAEP (see above) directly.

&nbsp;