TDC Applied Data Analytics Training Program 2020
Program Content
The goal of the TANF Data Collaborative’s Applied Data Analytics program is to help professionals at TANF and related human services agencies develop key data science skills by working hands on with confidential, administrative datasets. The curriculum is designed to introduce the data science skills necessary to work with confidential data. It provides a mix of online materials that participants can review at their own pace, interactive lectures, prepared materials provided using Jupyter Notebooks and guided project work in small groups. The curriculum draws on the second edition of Big Data and Social Science: A Practical Guide to Methods and Tools, which is available online https://textbook.coleridgeinitiative.org/
The first training module will begin May 11 and end June 5, the second training module will take place June 8-19, the third module on September 21-23 and final presentations will be October 23.
Table of contents:
Training program
Module 1: May 11, 2020 - June 5, 2020 (Online)
Module 2: Monday, June 8 - Friday, June 19 (Online)
Module 2.5: Wednesday, July 22 (Online)
Module 3: Monday, September 21 - Wednesday, September 23
Project presentations - Final presentations October 23; reports due November 6
COMMUNICATIONS
We distribute information and communicate within the group via this website, e-mail, and Slack. In general, the instructor team will respond quickly to either e-mail or Slack messages; however, we tend to prefer Slack for technical issues and sharing snippets of information.
E-mail addresses
Group email account
training@coleridgeinitiative.org - the group email for program instructors and coordinators; general program, logistics and ADRF system questions should be directed here (ADRF is our computing platform, see the computing environment page for more info)
For ADRF tech support please email support@coleridgeinitiative.org
Report all security incidents or suspected incidents (e.g., lost passwords, improper or suspicious acts) related to ADRF to adrf-security@coleridgeinitiative.org
Instructor email addresses:
Julia Lane - julia.lane@nyu.edu
Brian Kim - kimbrian@umd.edu
Clayton Hunter - clayton.hunter@nyu.edu or clayton@coleridgeinitiative.org
Ekaterina Levitskaya - ekaterina.levitskaya@coleridgeinitiative.org
Benjamin Feder - ben.feder@coleridgeinitiative.org
Slack
We use Slack extensively and it is often the best way to get in touch with us, the team is coleridge-initiative.slack.com and you will receive invitations to join when the Online Introduction material begins (if you have not please let us know). If you are unfamiliar with Slack, it has different "channels" to help organize conversations; previous classes have used Slack in various ways (eg for a channel for Python specific questions), but the two primary channels we expect to use are (i) the “tdc-2020” channel for general program discussion and sharing documents and (ii) the "adrf-tech-support" channel for any technical support in accessing the ADRF.
DATA DOCUMENTATION
The data providers, Coleridge team, and collaborators have created the below documentation for the datasets to be used in this program.
Indiana Department of Workforce Development: data dictionary (“in_dwd” schema)
Indiana Family and Social Services Administration (“in_fssa” schema)
Chapin Hall data model, created from the “tanf_adult”, “tanf_unioned_child”, and “tanf_family” raw tables
Additionally, the ADRF Explorer has dataset documentation: https://ds.adrf.cloud/projects/23 (you will need to log in with your ADRF credentials).
Lit review sources
F. Andersson, H. J. Holzer, J. I. Lane, Moving Up Or Moving On: Who Gets Ahead in the Low-Wage Labor Market? (Russell Sage Foundation, 2005).
F. Andersson, H. J. Holzer, J. Lane, in Studies of labor market intermediation (University of Chicago Press, 2009), pp. 373–398.
H. David, S. N. Houseman, Do temporary-help jobs improve labor market outcomes for low-skilled workers? Evidence from" Work First". Am. Econ. J. Appl. Econ. 2, 96–128 (2010).
“The Promise of Evidence-Based Policymaking: Report of the Commission on Evidence-Based Policy” (Washington, D.C.).
B. D. Meyer, J. X. Sullivan, “Measuring the well-being of the poor using income and consumption” (National Bureau of Economic Research, 2003).
B. D. Meyer, W. K. C. Mok, J. X. Sullivan, “The under-reporting of transfers in household surveys: its nature and consequences” (National Bureau of Economic Research, 2009).
David W. Stevens, 2007. "Employment that is not covered by state unemployment insurance Laws," Longitudinal Employer-Household Dynamics Technical Papers 2007-04, Center for Economic Studies, U.S. Census Bureau.
Additional report sources
Making Data Work for Families: Expanding impactful use of data with the Family Self-Sufficiency Data Center and TANF Data Innovations
https://www.chapinhall.org/project/making-data-work-for-families/
https://www.chapinhall.org/project/administrative-data-for-the-public-good/
https://www.chapinhall.org/project/studies-explore-use-of-administrative-data/
Module 1: Online Introduction MATERIAL
The Introduction to SQL and R is self-paced, delivered online and has weekly discussion sessions held via Zoom. You will not need to install any specific software or create an account on any technology platforms to complete the introductory material. We expect the content to take no more than 4 hours per week. Brian Kim, the instructor for the introductory material, will provide details the week of May 4.
Note: as some participants were not able to use Binder, some of the Intro to SQL and R materials have been transferred into the ADRF and translated to run there. The ADRF versions of the introduction notebooks can be found in the “Ada-Tdc-2020” project’s shared folder in a folder named “Intro-to-SQL-and-R”.
Module 2: June 8-19 (11am - 2pm eastern)
Zoom information:
URL link: https://nyu.zoom.us/j/99601784027?pwd=YURGOTdWYWZidS9DZUVQdzRZTTdPUT09
Meeting ID: 996 0178 4027
Meeting password: 197556
June 8 - Introduction, Motivation and Project Scoping (textbook chapter 1)
11:00-12:30: Introduction, Motivation, and Project Scoping (project template, Intro to ADRF videos, Day 1 slides)
12:30-1:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
1:00-1:05: Preview of day 2 (Daily feedback form)
1:05-1:35: Discussion session (second wave: CO and UT)
Discuss the labor market differences between 2016Q4 and 2009Q1. Review the literature and provide a summary of how employer activity is characterized.
June 9 - Dataset Exploration and Database Management (textbook chapter 2, textbook chapter 4)
11:00-11:55 Dataset Exploration and Database Management (Database pre-lecture video, TDC Data Structure pre-lecture videos; Day 2 slides)
11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
12:55-1:00: Preview of day 3 (Daily feedback form)
1:00-1:55: Discussion session (second wave: CO and UT)
Discuss what demographic characteristics could be chosen to compare and contrast the two cohorts.
June 11 - Applications of Dataset Exploration
Pre-work on the ADRF: the “03_Data_Exploration” notebook, follow-up questions
11:00-11:55: Applications of Dataset Exploration (slides)
11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
12:55-1:00: Preview of day 4 (Daily feedback form)
1:00-1:55: Discussion session (second wave: CO and UT)
Calculate how many individuals are in the wage record data in 2009Q1 and compare it to 2016Q4. Compare the number of individuals at the county level for the top 5 counties in each year.
June 12 - Basics of Data Visualization (chapter 6)
Pre-work: lecture videos and accompanying questions
11:00-11:55: Basics of Data Visualization (slides)
11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
12:55-1:00: Preview of day 5 (Daily feedback form)
1:00-1:55: Discussion session (second wave: CO and UT)
Recreate code to link 2009Q1 cohort to UI wage records. Compare earnings outcomes to those for the 2016Q4 cohort:
What percentage of the cohort was employed during at least one quarter in the following year?
What are the quarterly earnings thresholds at the 1, 10, 25, 50, 75, 90, 99 percentiles?
How does employment by county compare across the two cohorts? (report on top 5 counties in each cohort)
June 15 - Applications of Data Visualization
Pre-work on the ADRF: the “05_Data_Visualization” notebook and questions
11:00-11:30: Applications of Data Visualization (slides)
11:30-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
12:55-1:00: Preview of day 6 (Daily feedback form)
1:00-2:30: Discussion session (second wave: CO and UT)
Compare the quarterly earnings distribution of the 2009Q1 cohort with that of 2016Q4 using a histogram. Calculate the number of employers for each individual in each cohort and create a bar chart that compares the relative frequencies.
June 16 - Record Linkage (chapter 3)
Pre-work: pre-lecture videos
11:00-11:55: Record Linkage (slides)
11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
12:55-1:00: Preview of day 7 (Daily feedback form) - Blocking slides
1:00-1:55: Discussion session (second wave: CO and UT)
Create a heatmap of the employment by county for the 2009Q1 cohort and compare to the heatmap of employment by county for the 2016Q4 cohort.
June 18 - Applications of Record Linkage
Pre-work on the ADRF: the “07_Employer_Measures” notebook and questions
11:00-11:55: Applications of Record Linkage (slides)
11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
12:55-1:00: Preview of day 7 (Daily feedback form)
1:00-1:55: Discussion session (second wave: CO and UT)
Review interim presentations for June 19
Compare the overall labor markets in the following year after leaving tanf (2009Q2-2010Q1 to 2017Q1-2017Q4).
How many employers were there in each quarter? Do both (1) for all workers and (2) for TANF recipients
What were the most common industries? Do both (1) for all workers and (2) for TANF recipients
June 19 - Module 2 Presentations
Please fill out this feedback form for each team’s presentation
11:00-12:30: Module 2 Presentations (slides)
Present what has been accomplished thus following the project template, and include which skills/ideas can be useful in your TDC Pilot project
Module 2.5: July 22 (11am - 2pm Eastern)
Pre-work
Machine Learning videos and questions (in particular the “intro to ML” and “clustering” videos and questions)
Notebook on the ADRF: 09_ML_Clustering
Zoom information:
URL link: https://nyu.zoom.us/j/99601784027?pwd=YURGOTdWYWZidS9DZUVQdzRZTTdPUT09
Meeting ID: 996 0178 4027
Meeting password: 197556
July 22 - Introduction to Machine Learning (chapter 7)
11:00-11:55: Plenary session (slides)
12:00-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
12:55-1:00: Reconvene full group (Daily feedback form)
1:05-2:00: Discussion session (second wave: CO and UT)
Discussion focus: ADA project check-in and an initial brainstorm of how clustering could/does apply to your group’s project
Module 3: September 21-23 (11am - 2pm, 3:00pm - 4:30pm eastern)
Zoom information:
URL link: https://coleridgeinitiative-org.zoom.us/j/98784278094?pwd=eTBMUHVFbTBNak9SNjRGWFlUMEVaZz09
Meeting ID: 987 8427 8094
Meeting password: 266290
September 21 - Interim Presentations and Machine Learning, continued (chapter 7)
Pre-work - Review notebook on the ADRF (shared folder —> Module-2-Notebooks/tdc_July-7: 09_ML_Clustering)
11:00-11:15: Opening Remarks (slides)
11:15-12:30: Interim Presentations - please fill out the feedback form for each team
3:00-3:30: Applications of Machine Learning plenary
3:30-4:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
Discussion focus: Begin implementing feedback from interim presentation. (Daily feedback form)
4:00-4:30: Discussion session (second wave: CO and UT)
September 22 - Inference (chapter 10, chapter 11)
Pre-work - Errors and Inference videos and questions, Notebook on the ADRF: 11_Inference_Imputation
11:00-11:55: Plenary session (slides, Daily feedback form)
Discussion focus: Identify caveats and imputation potential for group project.
12:00-1:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
1:00-2:00: Discussion session (second wave: CO and UT)
3:00-3:30: Applications of Inference (slides)
3:30-4:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
4:00-4:30: Discussion session (second wave: CO and UT)
September 23 - Privacy, Confidentiality, and Ethics (chapter 12)
Pre-work - Privacy and Confidentiality videos and questions, Notebook on the ADRF: 12_Disclosure_Review
11:00-11:55: Plenary session (slides, Daily feedback form)
Discussion focus: Discuss outstanding analyses to complete for final presentation.
12:00-1:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
1:00-2:00: Discussion session (second wave: CO and UT)
3:00-3:30: Submitting an Export plenary
Discussion focus: Begin preparing work for exporting and map out timeline for all exports.
3:30-4:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)
4:00-4:30: Discussion session (second wave: CO and UT)
Projects - final presentations october 23
Final presentations are expected to be 20 minutes long followed by 10 minutes of feedback from the instructor group. Each participant should provide brief feedback for all other teams’ presentations via this form.
Final project reports are due by the close of business on November 6.
Presentation schedule
12:00pm to 1:30pm Eastern: session 1
12:00 - 12:30: Virginia (Final Report)
12:30 - 1:00: Minnesota (Final Report)
1:00 - 1:30: New Jersey (Final Report)
1:30pm to 1:45pm Eastern: break
1:45pm to 3:15pm Eastern: session 2
1:45 - 2:15: Michigan (Final Report)
2:15 - 2:45: New York (Final Report)
2:45 - 3:15: Colorado (Final Report)
3:15pm to 3:30pm Eastern: break
3:30pm to 4:30pm Eastern: session 3
3:30 - 4:00: Utah (Final Report)
4:00 - 4:30: California (Final Report)
Videoconference
Join Zoom Meeting
https://coleridgeinitiative-org.zoom.us/j/93928104335?pwd=V3pPbFgxNnR5K0dNZHRqT3Fxb1hoUT09
Meeting ID: 939 2810 4335
Passcode: 792157