=================================
'passive capture' is my first pass at a GeoDjango application. Regrettably due to time-constraints and package conflicts between Django_2.0/PostGIS/Heroku, I was unable to complete a delivered product that I could reliably host for end use. The application is functional locally, and the code can be viewed/downloaded here:
Passive Capture Python github repository
Although far from perfect, I hope to deliver a crude tool which enable time-series analysis and visualization to fish passage tracking efforts on the Columbia River. It's been a bunch of fun working with a new platform, and experimenting with statistical/GIS packages in python I may have not had the chance to use otherwise. As time allows, I hope to complete the application, and provide hosting and improvements.
To provide a tool which enables users to create custom time-series forecasts and visualizations for anadramous fish returns along the Columbia River.
Compare the efficacy of Facebook's open source time series statistical package versus traditional ARIMA methods
Below are my notes taken during the devleopment of this application
I've taken a crack at fish passage modeling before, and felt that using the US Army Core of Engineers (USACE) fish passage data for the Columbia River would be the most familiar and applicable for this project. The Fish Passage Center (FPC) provides a great interface to collect this data, and I've developed APIs in both Ruby and R to query the current days data.
I wanted to work with a static set of data initially, as it would require less API overhead than to keep and API up and running to ETL (Extract, Transform, Load) the FPC data on a schedule
First Steps:
Example Query to normalize the FPC Data Into BigQuery
SELECT
PARSE_DATE("%m/%d/%Y", date) AS count_date,
dam,
ChinookAdult AS chinook_adult,
ChinookJack AS chinook_jack,
CohoAdult AS coho_adult,
CohoJack AS coho_jack,
Steelhead AS steelhead,
WildSteelhead AS wild_steelhead,
Sockeye AS sockeye,
Pink AS pink,
Chum AS chum,
Lamprey AS lamprey,
Shad As shad
FROM `passive-capture.passage_data.daily_counts`
ORDER BY count_date;
Additonal Data Collected:
For this project, I'm exicted to experiment with Facebook's Prophet. The Prophet procedure is described as an additive regression model with four main components:
Perhaps the most interesting aspect of Facebook's open source Prophet time-series forecasting package is the ability to incorporate additional regressors into model -- something that has traditionally been difficult in time-series analysis software. Other attractive features Prophet offers are options to toggle 'seasonality detection', custom seasonality definitions, or add 'holidays'. Holidays are periods of time the model may expect additonal variance in trend, which seemed appropriate for runs of fish -- which occur seasonally.
For autoregessors, I hope to examine the impact of current and historical precipitation and dam flows as impacts to the model. Precipiation in both rain and snow play a driving force in fry and parr survival rates, as well as the timing in which salmon species will either 'hold', or push on with their migration. Flows through the dams as well play a critical part in determining parr/smolt predation and survival rates.
Autogressors:
NOAA Data was extracted from public datasets -- metdata came from https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt
Example for snow data grouped by year month:
SELECT
CONCAT( CAST(EXTRACT(YEAR FROM date) AS STRING),
CAST(EXTRACT(MONTH FROM date) AS STRING) ) AS year_month,
stations.name,
stations.latitude,
stations.longitude,
weather.id,
weather.element,
SUM(weather.value) AS snow_total,
AVG(DISTINCT weather.value) AS snow_average,
MIN(weather.value) AS snow_min,
MAX(weather.value) AS snow_max
FROM `bigquery-public-data.ghcn_d.ghcnd_stations` AS stations JOIN `bigquery-public-data.ghcn_d.ghcnd_201*` AS weather ON weather.id = stations.id WHERE stations.latitude > 43 AND stations.latitude < 49 AND stations.longitude > -123 AND stations.longitude < -120 AND element = 'SNWD'
GROUP BY year_month, stations.name, stations.latitude, stations.longitude, weather.id, weather.element
ORDER BY year_month
At present moment, there are 2 models I would like to pursue with this project. The first is a composite model between ARIMA and Prophet utilizing seasonality per distinct andaramous fish run and weather and flow features as autoregessors. The second would be ARIMA/Prophet time series analysis on each individual feature including run counts, flow, dissolved gas, precipitation etc -- and then compose a multivariate regression model using the predicted future values that may/may not contribute to run counts.
I'm taking the log of the run counts because the variance in any given sample over the time series differs significantly. In log-log regression model it is the interpretation of estimated parameter, say αi as the elasticity of Y(t) on Xi(t). In error-correction models we have an empirically stronger assumption that proportions are more stable (stationary) than the absolute differences. It is easier to aggregate the log-returns over time.
If you still want a statistical criterion for when to do log transformation a simple solution would be any test for heteroscedasticity. In the case of increasing variance I would recommend Goldfeld-Quandt Test or similar to it. In R it is located in library(lmtest) and is denoted by gqtest(y~1) function. Simply regress on intercept term if you don't have any regression model, y is your dependent variable.
import os
import pandas as pd
import numpy as np
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
from datetime import datetime
import matplotlib.pyplot as plt
from fbprophet import Prophet
import warnings
# Ignore upcoming deprecation warnings in the notebook
warnings.filterwarnings('ignore')
The forecast_generator.py
script is where we download, clean, predict, and store the data for time-series analysis
# method to clean and format the dataframe for analysis. arguments are the CSV as formatted from FPC, the species to target,
# and whether to display a 'count' of fish, or the log(count)
csv = './csv/passage_data/bon_passage_data.csv'
spp = 'chinook_adult'
def create_dataframe(csv, spp, display_count=False):
df = pd.read_csv(csv)
# isolate the dam
dam_name = df['dam'][0]
# select columns to remove to isolate a spp
cols_to_remove = [col for col in df.columns if f"{spp}" not in col and 'count_date' not in col]
df = df.drop(cols_to_remove, axis=1)
# create date range to accommodate missing dates
df['count_date'] = pd.to_datetime(df['count_date'])
idx = pd.date_range(df['count_date'].iloc[0], df['count_date'].iloc[-1])
df.set_index('count_date',drop=True,inplace=True)
df = df.reindex(idx, fill_value=0).reset_index()
# for prophet, format to use ds and y cols
df['ds'] = df['index']
# flag return values to be either counts, or the log of the counts
if display_count:
df['y'] = df[f"{spp}"]
else:
df['y'] = np.log(df[f"{spp}"])
# replace all -inf
df['y'] = df['y'].replace([np.log(0)], 0)
df = df.drop(['index', f"{spp}"], axis=1)
return {'dataframe': df, 'dam': dam_name, 'spp': spp}
# run the method:
result_object = create_dataframe(csv,spp)
df = result_object['dataframe']
print('Here is an example output of the normalized dataframe. ds represents the date, and y is a log(count) for a given spp')
df.tail()
So we have a dataframe which represents our range of data from the first date in the CSV (Jan 1st 2010) to the last (April 9th 2018).
Some dates were missing so we added them into the dataframe, with 0 count for the fish count. That, as long as days where no fish passed, account for the error output stating that divide by zero encountered in log
. This is due to log(0)
being undefined (not a real number).
I elected to use the natural log (log(count)
) instead of the count
to perform analysis because there is an uneven variance across our error term -- or heteroscedasticity.
Let's explore the variance in our data set by looking at the following two line graphs representing data for chinook adult passage over our time series
raw_df = pd.read_csv(csv)
print('## ADULT CHINOOK COUNT AT BONEVILLE AS INTEGER ##')
chinook_count = raw_df['chinook_adult']
chinook_count[chinook_count > 0].plot(kind='line')
print('## ADULT CHINOOK COUNT AT BONEVILLE AS LOG(INTEGER) ##')
df['y'][df['y'] > 0 ].plot(kind='line')
Just by eyeballing it, it seems as if the log(count)
forecast will provide more accurate results relatively, and better predict the long-range, highly variant trends we're looking for. As seen below, the minimum value in our set is -1 and the maximum value is 67,521. The variance between days, months, dams and years can also be highly variable.
Let's continue by performing a Breusch-Pagan test for heteroscedasticity v. homoscedasticity. The null hypothesis of the Breusch-Pagan test is homoscedasticity (= variance does not depend on auxiliary regressors). If the p-value is less than .05, the null hypothesis is rejected.
We'll have to convert the count_date
from a string to a numerical value to serve as the independent variable in this experiment. Since we're concerened about month/day time series trends, I wanted to try stacking month/day values.
minimum = raw_df['chinook_adult'].min()
maximum = raw_df['chinook_adult'].max()
print('####')
print('The minimum in the dataframe is', minimum, ' and the maximum is', maximum)
print('####')
# ORIGINAL METHOD TO TRY AND CONVERT DATE STRING TO NUMBERICAL INDEPENDENT VARIABLE
# date = df['count_date'].apply(lambda x: (datetime.strptime(x, '%Y-%m-%d').month + datetime.strptime(x, '%Y-%m-%d').day* .01) )
# UPDATED METHOD TO COLUMN STACK MONTH/DAY AS INDEPENDENT VARIABLE
raw_df['month'] = raw_df['count_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').month)
raw_df['day'] = raw_df['count_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').day)
date = np.column_stack((raw_df['month'],raw_df['day']))
spp_count = raw_df['chinook_adult']
results = smf.OLS(spp_count,date).fit()
print(results.summary())
from statsmodels.stats.diagnostic import het_breuschpagan
name = ['Lagrange multiplier statistic', 'p-value',
'f-value', 'f p-value']
bp = statsmodels.stats.diagnostic.het_breuschpagan(results.resid, results.model.exog)
bp
pd.DataFrame(name,bp)
Since our p-value is less than 0.05, this indicates that heteroscedasticity is present, and we reject the null hypothesis of homoscedasticity.
# Reduce heteroscedasticity -- need to normalize log data to perform test
# raw_df['log_chinook_adult'] = np.log(raw_df['chinook_adult'])
# # replace all -inf
# raw_df['log_chinook_adult'] = raw_df['log_chinook_adult'].replace([np.log(0)], 0)
# log_spp_count = raw_df['log_chinook_adult']
# print(raw_df.describe())
# results = smf.OLS(log_spp_count,date).fit()
# print(results.summary())
# name = ['Lagrange multiplier statistic', 'p-value',
# 'f-value', 'f p-value']
# bp = statsmodels.stats.diagnostic.het_breuschpagan(results.resid, results.model.exog)
# bp
# pd.DataFrame(name,bp)
Now that we have cleaned, formatted, and (somewhat) normalized our dataset, let's fit it to Prophet and start getting some time-series analysis.
import matplotlib.pyplot as plt
from fbprophet import Prophet
def predict_passage(df,dam,spp,num_of_days):
m = Prophet(interval_width=0.95, daily_seasonality=False)
m.fit(df);
future = m.make_future_dataframe(periods=num_of_days)
forecast = m.predict(future)
m.plot(forecast,uncertainty=True,xlabel='year', ylabel='log(passage count)')
forecast['dam'] = dam
forecast['spp'] = spp
forecast.to_csv(f"./csv/forecasts/{dam}_{spp}_forecast.csv")
plt.show()
print('## ADULT CHINOOK AT BONEVILLE ##')
predict_passage(result_object['dataframe'],result_object['dam'],result_object['spp'],365)
Impressively, Prophet is able to pick up on long-range seasonality trends out-of-the-box, and without any additional autoregressors it's able to predict a line of best fit in this chinook run.
Let's inspect what's going on a little bit more in the predict_passage
method:
df = result_object['dataframe']
num_of_days = 30
print("Prophet() instansiates an object")
m = Prophet(interval_width=0.95, daily_seasonality=False)
print("We call the object to fit our dataframe")
m.fit(df);
print("We create a new dataframe with the desired number of days in the future to forecast")
future = m.make_future_dataframe(periods=num_of_days)
print("Prophet outputs a dataframe of it's forecast")
forecast = m.predict(future)
forecast.tail()
Prophet returns a large DataFrame with many interesting columns, but the columns with the most relevant to forecasting are:
Prophet also enables us to look more closely at the components used in forecasting.
m.plot_components(forecast)
from sklearn.metrics import mean_absolute_error
df = result_object['dataframe']
df2 = m.predict(df)
df2['y'] = df['y']
# mean_absolute_error(np.exp(df2['yhat_lower']), np.exp(df2['yhat']))
csv = './csv/passage_data/jda_passage_data.csv'
spp = 'chinook_adult'
result_object = create_dataframe(csv,spp)
print('## ADULT CHINOOK AT JOHN DAY ##')
predict_passage(result_object['dataframe'],result_object['dam'],result_object['spp'],365)
csv = './csv/passage_data/bon_passage_data.csv'
spp = 'sockeye'
result_object = create_dataframe(csv,spp)
print('## SOCKEYE AT BONEVILLE ##')
predict_passage(result_object['dataframe'],result_object['dam'],result_object['spp'],365)
The Prophet time-series forecasts have exceeded my expectations for time series analysis and performance, and I believe they provide an excellent resource to use in mapping and visualizing fish passage data.
The next step is to display the information rendered from the forecast visually.
I've been wanting to experiment with folium
for while now, which is an open source leaflet powered map package for python. I started by getting the lat/long of the dams I wished to visualize, and then experimented with generating a map with markers and two tile layers.
import folium
from folium.plugins import HeatMap
dam_location_dictionary = {'BON':[45.644167,-121.9449334],
'TDA':[45.6125567,-121.137872],
'JDA':[45.7148507,-120.6958588],
'MCN':[45.9322088,-119.3021506]
}
m = folium.Map(location=(45.7,-121.25),
zoom_start=7
)
dam_locations = folium.FeatureGroup("Dams")
for key, value in dam_location_dictionary.items():
dam_locations.add_child(folium.CircleMarker(location=value,
radius=25,
popup=key,
color='#3186cc',
fill=True,
fill_color='#3186cc'))
m.add_child(dam_locations)
folium.TileLayer('Stamen Terrain').add_to(m)
folium.LayerControl().add_to(m)
m
It's pretty easy to utilize folium to get a map up and running. Let's look at leeraging the heat map plugin to generate relative concentrations of salmon counts at each passge point.
Since the data has not been aggregated and formatted in a usable way yet, I'm going to generate a dictionary(hash map) of dam lat/lng points and insert a random float value for weight.
m = folium.Map(location=(45.7,-121.25),
zoom_start=7
)
dam_location_dictionary = {'BON':[45.644167,-121.9449334],
'TDA':[45.6125567,-121.137872],
'JDA':[45.7148507,-120.6958588],
'MCN':[45.9322088,-119.3021506]
}
dam_df = pd.DataFrame(list(dam_location_dictionary.items()), columns=['name', 'lat/lng'])
print('## Dam Dataframe ##')
print(dam_df)
import random
damArr = []
for v in dam_df['lat/lng']:
damArr.insert(len(damArr),v)
for sublist in damArr:
map(float, sublist)
sublist.insert(len(sublist),random.uniform(1.5, 101.9))
print('## list of lat long coords ##')
print(damArr)
# plot heatmap
m.add_child(HeatMap(damArr, radius=25))
folium.TileLayer('Stamen Terrain').add_to(m)
folium.LayerControl().add_to(m)
m
As we can see, the map is displaying points where we have dam sites. In using the heat map functionality, we'll have to add points as a function of how many fish spp are passing through at any given moment. This is going to take some data manipulation in pandas.
m = folium.Map(location=(45.64,-121.94),
zoom_start=10
)
dam_location_dictionary = {'BON':[45.644167,-121.9449334],
'BON1':[45.654167,-121.9549334],
'BON2':[45.634167,-121.9349334],
'BON3':[45.664167,-121.9649334],
'BON4':[45.614167,-121.9149330],
'TDA':[45.6125567,-121.137872],
'JDA':[45.7148507,-120.6958588],
'MCN':[45.9322088,-119.3021507],
'MCN1':[45.9322088,-119.3221508],
'MCN2':[45.9322088,-119.3121506]
}
dam_df = pd.DataFrame(list(dam_location_dictionary.items()), columns=['name', 'lat/lng'])
print('## Dam Dataframe ##')
print(dam_df)
import random
damArr = []
for v in dam_df['lat/lng']:
damArr.insert(len(damArr),v)
for sublist in damArr:
map(float, sublist)
sublist.insert(len(sublist),random.uniform(1.5, 101.9))
print('## list of lat long coords ##')
print(damArr)
# plot heatmap
m.add_child(HeatMap(damArr, radius=25))
folium.TileLayer('Stamen Terrain').add_to(m)
folium.LayerControl().add_to(m)
m
The tack I'm going to take on this is adding additional points per count magnitude. Therefore for every count of fish passage, I will add a point. Above, the points were added in a line. I would prefer to make the points in a circle around the central dam point.
import json
import geog
import shapely.geometry
from iteration_utilities import deepflatten
mock_dam_dataframe = pd.DataFrame({ 'name': 'BON',
'lat/lng': [[45.644167,-121.9449334]],
'spp_count': 360})
def generate_points_around_dam(dam_dataframe):
name = dam_dataframe['name'][0]
coords = dam_dataframe['lat/lng'][0]
spp_count = dam_dataframe['spp_count'][0]
# shapely takes lon/lat, so swap coords
p = shapely.geometry.Point(coords[::-1])
d = 10 * 40 # meters
angles = np.linspace(0, 360, spp_count)
polygon = geog.propagate(p, angles, d)
points = shapely.geometry.mapping(shapely.geometry.Polygon(polygon))['coordinates']
points = list(deepflatten(points, depth =1))
for i,sublist in enumerate(points):
points[i] = list(sublist)
# unswap coordinates
points[i] = points[i][::-1]
return(points)
bon_points = generate_points_around_dam(mock_dam_dataframe)
for sublist in bon_points:
map(float, sublist)
# add a random weight to the heat map -- eventually this should be a number relative to orders of magnitude above other dams
sublist.insert(len(sublist),random.uniform(1.5, 12.0))
bon_points = bon_points + damArr
m = folium.Map(location=(45.7,-121.25),
zoom_start=6
)
# plot heatmap
m.add_child(HeatMap(bon_points, radius=25))
folium.TileLayer('Stamen Terrain').add_to(m)
folium.LayerControl().add_to(m)
m
Alright, so now on our heat map, if there are 360 fish moving through Boneville, the map will reflect that. Let's get some real data in place.
First, I'm going to create an 'chinook adult' forecast for all 4 dams.
spp = 'chinook_adult'
csvs = [os.path.abspath(f"./csv/passage_data/{x}") for x in os.listdir('csv/passage_data')]
for csv in csvs:
result_object = create_dataframe(csv,spp)
predict_passage(result_object['dataframe'],result_object['dam'],result_object['spp'],365)