Abstract
This project is meant to show the effects of Financial Aid within New York. Using visualizations and research, this project hopes to identify which demographics apply for Financial Aid and correlations within the data.
Overview
This project was created with the idea to research Financial Aid within New York City. My original hypothesis was that Financial Aid would prove to be necessary with lower income areas and minority groups. However, FAFSA demographic data does not exist specifically for States, except with the number of independent and dependent students there are.
From FAFSA's data, we can find the number of applications submitted per Quarter every year. I got the idea to calculate the rate of these submissions for NYC high schools. To achieve this, we can divide number of submissions with the enrollment of 12th graders for each high school, provided by NYS Education Department. Using pandas, pandasql and matplotlib, we can visualize the average rate per Borough. Computing the rates allowed additional data to be added, like income data, helping identify some correlations.
Data
http://www.p12.nysed.gov/irs/statistics/enroll-n-staff/home.html
​
NYS Education Department gives us Enrollment data for NY State Public Schools. This data was used to get the number of students in 12th grade for each high school in NYC by merging it with FAFSA's data.
​
https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school
​
Federal Student Aid gives us the FAFSA data. These datasets are the number of applications that are submitted and completed. They are broken down by year and by quarter. Within each gives you a list of all high schools in the nation and the number of submitted and completed. Federal Student Aid also grants the option to select by State, which was what was chosen for this project. However, this option only gives you the most recent quarters. Previous years must be obtained by looking through data with all the schools in the nation. Since this project only is looking at NYC, I chose the most recent quarters for NY state.
​
https://studentaid.gov/data-center/student/application-volume/fafsa-school-state
​
These datasets hold demographic data for FAFSA applicants by year. Since this data is for the whole country, I could not use those. This project is using the dataset for NY state for 2020-2021 quarter 3. This dataset shows the independent and dependent applications for that quarter, as well as the total number for the year to date. With this data, you can see the number of submission for the specific time frame, for example April 2020 - June 2020, and the total number of independents and dependents for the year so far.
​
https://data.cccnewyork.org/data/download#0,8/66
​
Citizen's Committee for Children of New York offers multiple datasets related to NYC. For this project, we are using Median Income in NYC. This dataset gives us median income for every Borough based on the household type: Families without Children, Families with Children, Families or All Households.
​
https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm
​
NYC Open Data provides different NYC boundaries to export as a GeoJSON and draw on a map. This project is working with Boroughs, so we are exporting the Borough Boundaries to create the Boroughs on the map.
Map of Average Completion Rate per Borough June 2020 (Q3)
This map illustrates the average Completion Rate per Borough for the third Quarter. From this, we can see Manhattan and Queens have the highest Rates of Completion, while Bronx has the least.
The purpose of analyzing Completion Rates for FAFSA Applications is to record the demand for Federal Aid.
Significance of Completion Rates
After cleaning the FAFSA Completed Applications data, I was able to merge it with NYC High School Enrollment data and calculate the completion rates per NYC High School, giving us a sample of most NYC High Schools.
Map of Median Income per Borough in 2019
On the graph, we notice the plots clustering. However, there are also a few outliers. While investigating the outliers, I found schools with the highest rate Brooklyn Frontiers, had an overwhelmingly small population of students. Schools with the lowest, such as John F Kennedy, had less than 5 FAFSA Applications submitted for all quarters.
We can compare the previous map with the median income for each Borough in 2019. From this map, we can see that Manhattan and Staten Island have the highest median incomes, while the Bronx has the lowest. There could be numerous reasons for this difference. For example, students who reside in a specific Borough may attend school within another.
Note: 2019 was purposefully chosen to reflect median income per Borough because 2020 FAFSA Applications would mostly reflect 2019 Tax information.
Here is our previous data re-illustrated into Bar graphs to show everything in more detail.
Note: June 2020/Quarter 3 was chosen to sample because FAFSA Data is broken down by Quarters and June 2020 is Federal FAFSA deadline.
Fun Facts
-
Both the schools with the highest completion rates and the schools with the lowest completion rates had generally low population of 12th graders. Most below a class of 100 students.
​
-
Among the top 10 and lowest 10, the schools with higher completion rates still had a higher population of 12th grade students.
​
-
Among the specialized high schools, Staten Island Technical High School had the best rate of completion for FAFSA with a rate of 99.4%, followed quickly by Stuyvesant High School (92.9%) and The Bronx High School of Science (90.1%).
​
-
Fiorello H. Laguardia High School was the lowest among the specialized high schools with a rate of 68.4%.
-
There is a difference between submitted FAFSA applications and completed FAFSA applications. According to the Federal Student Aid, these submitted applications are "rejected by the Central Processing System if they are missing key pieces of information."
​
​
From these bar charts, we may say that, with the exception of Staten Island, Boroughs with high median income had a higher rate of completion for FAFSA applications.
Independent and Dependent Application Submissions
By the end of Q3 (June 2020), we can see the total number of FAFSA submissions from students who are either dependent or independent on their tax information. This pie chart shows there are mostly dependent students by the end of this quarter, which also happens to be the FAFSA deadline.
From this, we can conclude most FAFSA application submissions in NY, at least for this sample, were from students who are still dependent on another.
Conclusion
From the data and visualizations, we can see the demographics requesting for Federal Student Aid within NYC. From this sample, mostly those who are dependent are completing the FAFSA application. We can also see there is a slight correlation between median income per Borough and the completion rate of FAFSA applications per borough. With the exception of Staten Island, in general, Boroughs with higher median incomes, had higher completion rates. This disprove my hypothesis and the reason may be that the higher income areas are better at having students submit FAFSA. Another reason is that I computed everything using completed applications, it is possible applications from these areas were submitted but not properly.
Citations
http://www.p12.nysed.gov/irs/statistics/enroll-n-staff/home.html
https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school
https://studentaid.gov/data-center/student/application-volume/fafsa-school-state
https://data.cccnewyork.org/data/download#0,8/66
https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm
Program Code
'''
Name: Diamilatou Barry
Email: diamilatou.barry59@myhunter.cuny.edu
Course: CSci 39542, Hunter College
Title: The Advantage of Financial Aid
Abstract: This project is meant to show the effects of Financial Aid within New York. Using visualizations and research, this project hopes to identify which demographics apply for Financial Aid and correlations within the data.
Resources: using pandas, pandasql, numpy, matplotlib and folium libraries. Using class materials, lectures, textbook and documentation.
schoolsChoropleth.py from lecture to create choropleth maps.
NYS Education Department http://www.p12.nysed.gov/irs/statistics/enroll-n-staff/home.html
FAFSA https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school
FAFSA https://studentaid.gov/data-center/student/application-volume/fafsa-school-state
Citizen's Committee for Children of New York https://data.cccnewyork.org/data/download#0,8/66
NYC Open Data https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm
URL: https://diamilatoubarry59.wixsite.com/financial-aid
'''
import pandas as pd
from pandasql import sqldf
import numpy as np
import matplotlib.pyplot as plt
import folium
# Reading and cleaning datasets
# Fafsa data NY.csv provides number of submitted and completed fafsa application
# Citizen's Committee for Children of New York's data provides enrollment total for all NYC public schools
nyCompletion = pd.read_csv('NY.csv')
nyEnrollment = pd.read_csv('PublicSchool2021AllStudents.csv')
df = pd.DataFrame(nyCompletion)
df = df.replace('<5', 4)
# Obtain only high schools with a graduating class
q1 = 'select "NAME","GRADE 12" from nyEnrollment where "Grade 12" > 0'
q1_columns = sqldf(q1)
df1 = pd.DataFrame(q1_columns)
# print(df1.nlargest(10, 'GRADE 12'))
# merging FAFSA application submissions with HS enrollment data to clean and use together
dfSchools = df.merge(df1, how='outer', on='NAME')
dfSchools = dfSchools.dropna()
cols = ['Complete Jun 2020','Complete Dec 2020','GRADE 12']
dfSchools[cols] = dfSchools[cols].astype(int)
# calculating completion rates by dividing completed applications with 12th grade enrollment per HS
dfSchools['Rates'] = dfSchools['Complete Jun 2020']/dfSchools['GRADE 12']
# fitering to remove the extreme outliers, some going as high as 500%
dfSchools = dfSchools[(dfSchools["Rates"] < 1.1)]
# print(dfSchools['Rates'])
# print(dfSchools.nlargest(10, 'Rates'))
# dataset is inconsistent with naming cities, lists boroughs but breaks down Queens by neighborhoods
# taking a sample of most NYC high schools
cities = ['BRONX', 'BROOKLYN', 'NEW YORK', 'FLUSHING', 'FAR ROCKAWAY','JAMAICA','BELLEROSE','STATEN ISLAND','LONG ISLAND CITY', 'BAYSIDE']
dfSchools = dfSchools.loc[dfSchools['City'].isin(cities)]
rates = dfSchools['Rates']
schools = dfSchools['NAME']
# Check 10 largest and 10 smallest rates for the schools
# print("Largest Rates: ", dfSchools.nlargest(10, 'Rates'))
# print("Smallest Rates: ", dfSchools.nsmallest(10, 'Rates'))
# Check rated of specialized high schools in nyc
specialized = ['FIORELLO H LAGUARDIA HIGH SCHOOL','BRONX HIGH SCHOOL OF SCIENCE (THE)','BROOKLYN LATIN SCHOOL (THE)','BROOKLYN TECHNICAL HIGH SCHOOL','HIGH SCHOOL FOR MATHEMATICS, SCIENCE AND ENGINEERING AT CITY COLLEGE OF NEW YORK','HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE','QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK COLLEGE','STATEN ISLAND TECHNICAL HIGH SCHOOL','STUYVESANT HIGH SCHOOL']
print(dfSchools.loc[dfSchools['NAME'].isin(specialized)])
# x axis data clutters, would be better to group by city
colors = np.arange(252)
plt.scatter(schools, rates, c=colors, edgecolors='face', cmap='viridis')
plt.xlabel('High Schools')
plt.xticks(schools[::2], rotation='vertical')
plt.tick_params(labelbottom=False)
plt.ylabel('Completion Rates')
plt.title('Completion Rates per NYC High School in June 2020')
fig1 = plt.gcf()
plt.show()
# plt.draw()
fig1.savefig('NYC_Fafsa_Completion.png', dpi=100)
# Getting completion percentage of FAFSA Applications per NYC Borough
dfSchools2 = dfSchools.groupby('City').agg("mean").reset_index()
dfSchools2 = dfSchools2.loc[dfSchools2['City'].isin(cities)]
# print(dfSchools2)
# Clean the data so the names are the same and can be grouped
dfSchools2 = dfSchools2.replace('BRONX', 'Bronx')
dfSchools2 = dfSchools2.replace('BROOKLYN', 'Brooklyn')
dfSchools2 = dfSchools2.replace('NEW YORK', 'Manhattan')
dfSchools2 = dfSchools2.replace('FLUSHING', 'Queens')
dfSchools2 = dfSchools2.replace('JAMAICA', 'Queens')
dfSchools2 = dfSchools2.replace('LONG ISLAND CITY', 'Queens')
dfSchools2 = dfSchools2.replace('BAYSIDE', 'Queens')
dfSchools2 = dfSchools2.replace('STATEN ISLAND', 'Staten Island')
dfSchools2 = dfSchools2.groupby('City').agg("mean").reset_index()
borough = dfSchools2['City']
# Convert rate to percentages
rates = dfSchools2['Rates'] * 100
dfSchools2['Percentage'] = dfSchools2['Rates'] * 100
# print(dfSchools2)
# Make bar graph of completion percentage
b_colors = ['#FFC93C', '#781D42', '#3CA59D', '#ED6663', '#9B3675']
plt.figure()
x_pos = [i for i, _ in enumerate(borough)]
plt.ylim(0, 100)
plt.bar(borough, rates, color=b_colors)
plt.xlabel("Borough")
plt.ylabel("Percent of Completed Applications")
plt.title("Percent of Completed FAFSA Applications per Borough in June 2020")
plt.xticks(x_pos, borough)
fig2 = plt.gcf()
plt.show()
fig2.savefig('Borough_Completion.png', dpi=100)
# Using median income data from 2019 because 2020 FAFSA would use 2019 tax information
nyIncomes = pd.read_csv('Median_Incomes.csv')
q2 = 'select * from nyIncomes where ("Location"=="Bronx" AND "Household Type"=="All Households" AND "TimeFrame"=="2019") OR ("Location"=="Brooklyn" AND "Household Type"=="All Households" AND "TimeFrame"=="2019") OR ("Location"=="Manhattan" AND "Household Type"=="All Households" AND "TimeFrame"=="2019") OR ("Location"=="Staten Island" AND "Household Type"=="All Households" AND "TimeFrame"=="2019") OR ("Location"=="Queens" AND "Household Type"=="All Households" AND "TimeFrame"=="2019")'
q2_columns = sqldf(q2)
dfIncomes = pd.DataFrame(q2_columns)
# print(dfIncomes)
# Create bar graph of the median incomes in nyc
dfIncomes['Data'] = dfIncomes['Data'].astype(int)
borough = dfIncomes['Location']
income = dfIncomes['Data']
plt.figure()
x_pos = [i for i, _ in enumerate(borough)]
plt.bar(borough, income, color=b_colors)
plt.ylim(0, 100000)
plt.xlabel("Borough")
plt.ylabel("Median Income")
plt.title("Median Income per Borough in 2019")
plt.xticks(x_pos, borough)
fig3 = plt.gcf()
plt.show()
fig3.savefig('Borough_Income.png', dpi=100)
# Plot incomes on map of nyc boroughs
vis = 'Borough_Boundaries.geojson'
m = folium.Map(location=[40.7128,-74.0060], scale=13, tiles="cartodbpositron")
choropleth = folium.Choropleth(
geo_data=vis,
name="choropleth",
data = dfIncomes,
columns = ['Location', 'Data'],
fill_color = 'Greens',
key_on = 'feature.properties.boro_name'
).add_to(m)
folium.LayerControl().add_to(m)
m.save('nycIncomeMap.html')
# plot rates on map of nyc boroughs
vis2 = 'Borough_Boundaries.geojson'
m = folium.Map(location=[40.7128,-74.0060], scale=13, tiles="cartodbpositron")
choropleth = folium.Choropleth(
geo_data=vis2,
name="choropleth",
data = dfSchools2,
columns = ['City', 'Percentage'],
fill_color = 'Greens',
key_on = 'feature.properties.boro_name'
).add_to(m)
folium.LayerControl().add_to(m)
m.save('nycCompletionMap.html')
# read and clean demographic data for fafsa giving us number of independent and dependent applicants
demographic = pd.read_csv('2020-2021-app-data-by-state-q3.csv')
demographic = pd.DataFrame(demographic)
# read only data related to ny
ny = demographic.loc[demographic['State'] == 'New York']
dep = int(ny['Dependent Students'])
indep = int(ny['Independent Students'])
# Create pie chart of the demographic data
# plot two columns dependent and independent
ny_data = [dep, indep]
my_labels = ['Dependent', 'Independent']
my_colors = ['#4BD1A0','silver']
my_explode = (0, 0.1)
plt.figure()
plt.pie(ny_data,labels = my_labels, autopct='%1.1f%%', startangle=15, shadow = True, colors=my_colors, explode=my_explode)
plt.title('Independent vs Dependent End of June 2020')
plt.axis('equal')
fig4 = plt.gcf()
plt.show()
fig4.savefig('depVSindep.png', dpi=100)