TDC Applied Data Analytics Training Program 2020



Program Content

The goal of the TANF Data Collaborative’s Applied Data Analytics program is to help professionals at TANF and related human services agencies develop key data science skills by working hands on with confidential, administrative datasets. The curriculum is designed to introduce the data science skills necessary to work with confidential data.  It provides a mix of online materials that participants can review at their own pace, interactive lectures, prepared materials provided using Jupyter Notebooks and guided project work in small groups. The curriculum draws on the second edition of Big Data and Social Science: A Practical Guide to Methods and Tools, which is available online https://textbook.coleridgeinitiative.org/

The first training module will begin May 11 and end June 5, the second training module will take place June 8-19, the third module on September 21-23 and final presentations will be October 23.

Project Template

Table of contents:

COMMUNICATIONS

We distribute information and communicate within the group via this website, e-mail, and Slack. In general, the instructor team will respond quickly to either e-mail or Slack messages; however, we tend to prefer Slack for technical issues and sharing snippets of information.

E-mail addresses

Slack

We use Slack extensively and it is often the best way to get in touch with us, the team is coleridge-initiative.slack.com and you will receive invitations to join when the Online Introduction material begins (if you have not please let us know). If you are unfamiliar with Slack, it has different "channels" to help organize conversations; previous classes have used Slack in various ways (eg for a channel for Python specific questions), but the two primary channels we expect to use are (i) the “tdc-2020” channel for general program discussion and sharing documents and (ii) the "adrf-tech-support" channel for any technical support in accessing the ADRF.

DATA DOCUMENTATION

The data providers, Coleridge team, and collaborators have created the below documentation for the datasets to be used in this program.

Additionally, the ADRF Explorer has dataset documentation: https://ds.adrf.cloud/projects/23 (you will need to log in with your ADRF credentials).

Lit review sources

  • F. Andersson, H. J. Holzer, J. I. Lane, Moving Up Or Moving On: Who Gets Ahead in the Low-Wage Labor Market? (Russell Sage Foundation, 2005).

  • F. Andersson, H. J. Holzer, J. Lane, in Studies of labor market intermediation (University of Chicago Press, 2009), pp. 373–398.

  • H. David, S. N. Houseman, Do temporary-help jobs improve labor market outcomes for low-skilled workers? Evidence from" Work First". Am. Econ. J. Appl. Econ. 2, 96–128 (2010).

  • “The Promise of Evidence-Based Policymaking: Report of the Commission on Evidence-Based Policy” (Washington, D.C.).

  • B. D. Meyer, J. X. Sullivan, “Measuring the well-being of the poor using income and consumption” (National Bureau of Economic Research, 2003).

  • B. D. Meyer, W. K. C. Mok, J. X. Sullivan, “The under-reporting of transfers in household surveys: its nature and consequences” (National Bureau of Economic Research, 2009).

  • David W. Stevens, 2007. "Employment that is not covered by state unemployment insurance Laws," Longitudinal Employer-Household Dynamics Technical Papers 2007-04, Center for Economic Studies, U.S. Census Bureau.

Additional report sources

Module 1: Online Introduction MATERIAL

The Introduction to SQL and R is self-paced, delivered online and has weekly discussion sessions held via Zoom. You will not need to install any specific software or create an account on any technology platforms to complete the introductory material. We expect the content to take no more than 4 hours per week. Brian Kim, the instructor for the introductory material, will provide details the week of May 4. 

Note: as some participants were not able to use Binder, some of the Intro to SQL and R materials have been transferred into the ADRF and translated to run there. The ADRF versions of the introduction notebooks can be found in the “Ada-Tdc-2020” project’s shared folder in a folder named “Intro-to-SQL-and-R”.

Module 2: June 8-19 (11am - 2pm eastern)

Zoom information:

  • URL link: https://nyu.zoom.us/j/99601784027?pwd=YURGOTdWYWZidS9DZUVQdzRZTTdPUT09

  • Meeting ID: 996 0178 4027

  • Meeting password: 197556

June 8 - Introduction, Motivation and Project Scoping (textbook chapter 1)

  • 11:00-12:30: Introduction, Motivation, and Project Scoping (project template, Intro to ADRF videos, Day 1 slides)

  • 12:30-1:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 1:00-1:05: Preview of day 2 (Daily feedback form)

  • 1:05-1:35: Discussion session (second wave: CO and UT)

    • Discuss the labor market differences between  2016Q4 and 2009Q1.  Review the literature and provide a summary of how employer activity is characterized.

June 9 - Dataset Exploration and Database Management (textbook chapter 2, textbook chapter 4)

  • 11:00-11:55 Dataset Exploration and Database Management (Database pre-lecture video, TDC Data Structure pre-lecture videos; Day 2 slides)

  • 11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 12:55-1:00: Preview of day 3 (Daily feedback form)

  • 1:00-1:55: Discussion session (second wave: CO and UT)

    • Discuss what demographic characteristics could be chosen to compare and contrast the two cohorts. 

June 11 - Applications of Dataset Exploration

Pre-work on the ADRF: the “03_Data_Exploration” notebook, follow-up questions

  • 11:00-11:55: Applications of Dataset Exploration (slides)

  • 11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 12:55-1:00: Preview of day 4 (Daily feedback form)

  • 1:00-1:55: Discussion session (second wave: CO and UT)

    • Calculate how many individuals are in the wage record data in 2009Q1 and compare it to 2016Q4.  Compare the number of individuals at the county level for the top 5 counties in each year.

June 12 - Basics of Data Visualization (chapter 6)

Pre-work: lecture videos and accompanying questions

  • 11:00-11:55: Basics of Data Visualization (slides)

  • 11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 12:55-1:00: Preview of day 5 (Daily feedback form)

  • 1:00-1:55: Discussion session (second wave: CO and UT)

    • Recreate code to link 2009Q1 cohort to UI wage records. Compare earnings outcomes to those for the 2016Q4 cohort:

      • What percentage of the cohort was employed during at least one quarter in the following year?

      • What are the quarterly earnings thresholds at the 1, 10, 25, 50, 75, 90, 99 percentiles?

      • How does employment by county compare across the two cohorts? (report on top 5 counties in each cohort)

June 15 - Applications of Data Visualization

Pre-work on the ADRF: the “05_Data_Visualization” notebook and questions

  • 11:00-11:30: Applications of Data Visualization (slides)

  • 11:30-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 12:55-1:00: Preview of day 6 (Daily feedback form)

  • 1:00-2:30: Discussion session (second wave: CO and UT)

    • Compare the quarterly earnings distribution of the 2009Q1 cohort with that of 2016Q4 using a histogram.   Calculate the number of employers for each individual in each cohort and create a bar chart that compares the relative frequencies.

June 16 - Record Linkage (chapter 3)

Pre-work: pre-lecture videos

  • 11:00-11:55: Record Linkage (slides)

  • 11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 12:55-1:00: Preview of day 7 (Daily feedback form) - Blocking slides

  • 1:00-1:55: Discussion session (second wave: CO and UT)

    • Create a heatmap of the employment by county for the 2009Q1 cohort and compare to the heatmap of employment by county for the 2016Q4 cohort.

June 18 - Applications of Record Linkage

Pre-work on the ADRF: the “07_Employer_Measures” notebook and questions

  • 11:00-11:55: Applications of Record Linkage (slides)

  • 11:55-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 12:55-1:00: Preview of day 7 (Daily feedback form)

  • 1:00-1:55: Discussion session (second wave: CO and UT)

    • Review interim presentations for June 19

    • Compare the overall labor markets in the following year after leaving tanf (2009Q2-2010Q1 to 2017Q1-2017Q4).

      • How many employers were there in each quarter? Do both (1) for all workers and (2) for TANF recipients

      • What were the most common industries? Do both (1) for all workers and (2) for TANF recipients

June 19 - Module 2 Presentations

Please fill out this feedback form for each team’s presentation

  • 11:00-12:30: Module 2 Presentations (slides)

    • Present what has been accomplished thus following the project template, and include which skills/ideas can be useful in your TDC Pilot project

Module 2.5: July 22 (11am - 2pm Eastern)

Pre-work

Zoom information:

  • URL link: https://nyu.zoom.us/j/99601784027?pwd=YURGOTdWYWZidS9DZUVQdzRZTTdPUT09

  • Meeting ID: 996 0178 4027

  • Meeting password: 197556

July 22 - Introduction to Machine Learning (chapter 7)

  • 11:00-11:55: Plenary session (slides)

  • 12:00-12:55: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 12:55-1:00: Reconvene full group (Daily feedback form)

  • 1:05-2:00: Discussion session (second wave: CO and UT)

    • Discussion focus: ADA project check-in and an initial brainstorm of how clustering could/does apply to your group’s project

Module 3: September 21-23 (11am - 2pm, 3:00pm - 4:30pm eastern)

Zoom information:

  • URL link: https://coleridgeinitiative-org.zoom.us/j/98784278094?pwd=eTBMUHVFbTBNak9SNjRGWFlUMEVaZz09

  • Meeting ID: 987 8427 8094

  • Meeting password: 266290

September 21 - Interim Presentations and Machine Learning, continued (chapter 7)

Pre-work - Review notebook on the ADRF (shared folder —> Module-2-Notebooks/tdc_July-7: 09_ML_Clustering)

  • 11:00-11:15: Opening Remarks (slides)

  • 11:15-12:30: Interim Presentations - please fill out the feedback form for each team

  • 3:00-3:30: Applications of Machine Learning plenary

  • 3:30-4:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 4:00-4:30: Discussion session (second wave: CO and UT)

September 22 - Inference (chapter 10, chapter 11)

Pre-work - Errors and Inference videos and questions, Notebook on the ADRF: 11_Inference_Imputation

  • 11:00-11:55: Plenary session (slides, Daily feedback form)

    • Discussion focus: Identify caveats and imputation potential for group project.

  • 12:00-1:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 1:00-2:00: Discussion session (second wave: CO and UT)

  • 3:00-3:30: Applications of Inference (slides)

  • 3:30-4:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 4:00-4:30: Discussion session (second wave: CO and UT)

September 23 - Privacy, Confidentiality, and Ethics (chapter 12)

Pre-work - Privacy and Confidentiality videos and questions, Notebook on the ADRF: 12_Disclosure_Review

  • 11:00-11:55: Plenary session (slides, Daily feedback form)

    • Discussion focus: Discuss outstanding analyses to complete for final presentation.

  • 12:00-1:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 1:00-2:00: Discussion session (second wave: CO and UT)

  • 3:00-3:30: Submitting an Export plenary

    • Discussion focus: Begin preparing work for exporting and map out timeline for all exports.

  • 3:30-4:00: Discussion session (first wave: CA, MI, MN, NJ, NY, VA)

  • 4:00-4:30: Discussion session (second wave: CO and UT)

Projects - final presentations october 23

Final presentations are expected to be 20 minutes long followed by 10 minutes of feedback from the instructor group. Each participant should provide brief feedback for all other teams’ presentations via this form.

Final project reports are due by the close of business on November 6.

Presentation schedule

Videoconference

Join Zoom Meeting
https://coleridgeinitiative-org.zoom.us/j/93928104335?pwd=V3pPbFgxNnR5K0dNZHRqT3Fxb1hoUT09
Meeting ID: 939 2810 4335
Passcode: 792157