Exploring Urban Data with Machine Learning: Surface Temperature and Green Space Coverage

January–May 2022 | New York

Project Team Members: Hilary Ho, Sean Chew, HaoChe Hung, Theresa Yang

As part of our urban analytics coursework for the class “Exploring Urban Data and Machine Learning” during my urban planning program at Columbia GSAPP, my peers and I explored the relationship between green space and Urban Heat Island Effect in NYC. The focus of this project and course was to explore methodological differences between machine learning techniques in Python to answer a specific set of research questions.

Urban Heat Island Effect (UHIE) occurs when impervious surfaces replace areas of vegetation, causing increased surface temperatures in cities relative to surrounding areas. Inequitable green space access across NYC may cause some neighborhoods to be more at-risk for UHIE than others.

Questions

We sought to answer the following questions in our research using machine learning methods:

Main Research Question

To what degree can the presence of green space in neighborhoods predict surface temperatures in NYC?

Sub Research Questions

Given the difference in amount of green space across the boroughs, which neighborhoods are particularly vulnerable to the Urban Heat Island Effect?

What are the discrepancies in green coverage between different land uses (e.g. commercial, residential)?

Process

We relied on Landsat data to derive land surface temperatures across NYC, which was our dependent variable, as well as to measure the amount of green space within a given area (using the NDVI, or Normalized Difference Vegetation Index).

After cleaning the raw Landsat temperature data, we conducted 2 parallel workflows on different geographical units to conduct our Machine Learning analysis: census tracts (American Community Survey data from the US Census Bureau) and MapPLUTO tax lots (New York City tax lot data collected by the NYC Department of City Planning).

For our census tract analysis, we used Lasso and Random Forest models to predict land surface temperature with demographic features (e.g. race and income), and then we used K Means clustering and a Decision Tree model to conduct a spatial analysis to see whether there were any predicted spatial clustering patterns for urban heat island effect informed by demographic factors.

For our tax lot analysis, we conducted an OLS regression on land surface temperature and NDVI (or the presence of green space) to see how much these factors influenced one another at the tax lot level.

Results

Census Tract Analysis:

Random Forest model returned the highest test and training results.
The most important demographic feature in our model was non-family households (although feature importance of this factor was still <10%). This means that, in our model, the presence of non-family households was most responsible for spatial clustering patterns we saw between land surface temperature and census tract demographics.

We conducted a spatial analysis using K Means clustering and Decision Tree. K Means produced 3 clusters while Decision Tree produced 4 categories, showing how the adoption of these two different methodologies in our project produced slightly different clustering outcomes even when comparing the same variables.

Tax Lot Analysis:

OLS regression analysis

Neighborhood green space is a relatively weak predictor for LST. This is likely because of limitations related to using Landsat data in our analysis. For example, satellite imagery used for the analysis may have been compromised by cloud cover, affecting source data quality in our model.
Tax lots that fall under the “Public land use” category is the best predictor for land surface temperature on a city-wide scale in our model.