Preprocess¶
The first step in developing the envisioned web app is opening a dataset and preprocessing it!
Get Data¶
I like to go to https://mesonet.agron.iastate.edu/request/download.phtml?network=WA_ASOS for station data because of the easy-to-use interface and straightforward way to load data.
Select a network
Select stations
Select variables
Select a timezone
Select options
Click “Get Data”
Import Packages¶
[1]:
import numpy as np
import pandas as pd
import xarray as xr
import hvplot.pandas
import holoviews as hv
hv.renderer('bokeh').theme = 'caliber'
Load Data¶
Super straightforward to load the queried dataset; simply pass the url to pandas.
I selected Seattle because it’s notorious (maybe?) for its rainy days.
[2]:
url = (
'https://mesonet.agron.iastate.edu/cgi-bin/'
'request/asos.py?station=SEA&data=p01m&'
'year1=1928&month1=1&day1=1&'
'year2=2021&month2=3&day2=22&'
'tz=America%2FLos_Angeles&'
'format=onlycomma&latlon=yes&'
'elev=yes&missing=empty&'
'trace=0.0001&direct=n&'
'report_type=1&report_type=2'
)
df = pd.read_csv(url, parse_dates=True, index_col='valid')
Explore Data¶
Now let’s take a quick look at some stats and the ends of the dataframe.
[3]:
df.describe()
df.head(10)
df.tail(10)
[3]:
lon | lat | elevation | p01m | |
---|---|---|---|---|
count | 1.102588e+06 | 1.102588e+06 | 1102588.0 | 174587.000000 |
mean | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.464076 |
std | 1.550348e-09 | 7.358242e-10 | 0.0 | 0.872992 |
min | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.000000 |
25% | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.000000 |
50% | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.250000 |
75% | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.510000 |
max | -1.223144e+02 | 4.744470e+01 | 137.0 | 42.420000 |
[3]:
station | lon | lat | elevation | p01m | |
---|---|---|---|---|---|
valid | |||||
1948-01-01 00:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
1948-01-01 01:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
1948-01-01 02:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
1948-01-01 03:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
1948-01-01 04:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
1948-01-01 05:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
1948-01-01 06:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
1948-01-01 07:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
1948-01-01 08:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
1948-01-01 09:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
[3]:
station | lon | lat | elevation | p01m | |
---|---|---|---|---|---|
valid | |||||
2021-03-21 23:15:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
2021-03-21 23:20:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
2021-03-21 23:25:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
2021-03-21 23:30:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
2021-03-21 23:35:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
2021-03-21 23:40:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
2021-03-21 23:45:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
2021-03-21 23:50:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
2021-03-21 23:53:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0 |
2021-03-21 23:55:00 | SEA | -122.3144 | 47.4447 | 137.0 | NaN |
Clean Data¶
There seems to be a lot of NaNs so let’s drop them!
[4]:
df = df.dropna()
Now let’s check out the head and tail again!
At first, the time step seems to be about 2-3 hours, slowly transitioning to hourly, about 5 minutes near the end, and turns into hourly again at the last row.
[5]:
df.head(10)
df.tail(10)
[5]:
station | lon | lat | elevation | p01m | |
---|---|---|---|---|---|
valid | |||||
1948-12-13 00:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
1965-01-01 14:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
1965-01-01 15:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
1965-01-01 18:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
1965-01-01 23:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
1965-01-02 00:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
1965-01-02 02:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.51 |
1965-01-02 03:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.51 |
1965-01-02 04:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 1.02 |
1965-01-02 05:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.51 |
[5]:
station | lon | lat | elevation | p01m | |
---|---|---|---|---|---|
valid | |||||
2021-03-21 22:15:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 22:20:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 22:25:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 22:30:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 22:35:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 22:40:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 22:45:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 22:50:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 22:53:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 23:53:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.00 |
Because of this irregularity, I plan to resample the dataset to hourly. However, does the repeated 0.25 indicate 0.25 mm of precipitation every 5 minutes or an accumulation of 0.25 mm of precipitation for the last hour?
Thankfully, there’s documentation at https://mesonet.agron.iastate.edu/request/download.phtml?network=WA_ASOS:
*p01i: One hour precipitation for the period from the observation time to the time of the previous hourly precipitation reset*. This varies slightly by site. Values are in inches. This value may or may not contain frozen precipitation melted by some device on the sensor or estimated by some other means. Unfortunately, we do not know of an authoritative database denoting which station has which sensor.
Based on that information, we should take the maximum of each hour.
[6]:
df_rs = df.resample('1H').max()
We can take a look at the stats and ends again.
[7]:
df_rs.describe()
df_rs.head()
df_rs.tail()
[7]:
lon | lat | elevation | p01m | |
---|---|---|---|---|
count | 1.235910e+05 | 1.235910e+05 | 123591.0 | 123591.000000 |
mean | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.425334 |
std | 1.085714e-10 | 6.114245e-11 | 0.0 | 0.862160 |
min | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.000000 |
25% | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.000000 |
50% | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.000000 |
75% | -1.223144e+02 | 4.744470e+01 | 137.0 | 0.510000 |
max | -1.223144e+02 | 4.744470e+01 | 137.0 | 42.420000 |
[7]:
station | lon | lat | elevation | p01m | |
---|---|---|---|---|---|
valid | |||||
1948-12-13 00:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
1948-12-13 01:00:00 | NaN | NaN | NaN | NaN | NaN |
1948-12-13 02:00:00 | NaN | NaN | NaN | NaN | NaN |
1948-12-13 03:00:00 | NaN | NaN | NaN | NaN | NaN |
1948-12-13 04:00:00 | NaN | NaN | NaN | NaN | NaN |
[7]:
station | lon | lat | elevation | p01m | |
---|---|---|---|---|---|
valid | |||||
2021-03-21 19:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.00 |
2021-03-21 20:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.00 |
2021-03-21 21:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.00 |
2021-03-21 22:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.25 |
2021-03-21 23:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.00 |
Early on, there seems to be a lot of missing data. I want to remove the years with a bunch of missing data so it doesn’t skew the downstream statistics. But first, let’s check out how much missing there is to figure out a threshold.
[8]:
yr_cnt = df_rs.groupby(df_rs.index.year)['p01m'].count().sort_values()
yr_cnt
[8]:
valid
1956 0
1962 0
1961 0
1960 0
1959 0
...
2014 8737
2018 8744
2019 8748
2020 8762
2016 8769
Name: p01m, Length: 74, dtype: int64
There seems to be around 8784 rows (24 hours * 366 days) so let’s only keep the years where more than 90% of the year is available.
[9]:
yr_cnt = yr_cnt.loc[yr_cnt > 0.9 * 8784]
yr_cnt
df_sub = df_rs.loc[df_rs.index.year.isin(yr_cnt.index)].copy()
[9]:
valid
2012 8426
2017 8715
2015 8732
2013 8735
2014 8737
2018 8744
2019 8748
2020 8762
2016 8769
Name: p01m, dtype: int64
That leaves us with only 10 years to work with.
[10]:
df_sub
[10]:
station | lon | lat | elevation | p01m | |
---|---|---|---|---|---|
valid | |||||
2012-01-01 00:00:00 | NaN | NaN | NaN | NaN | NaN |
2012-01-01 01:00:00 | NaN | NaN | NaN | NaN | NaN |
2012-01-01 02:00:00 | NaN | NaN | NaN | NaN | NaN |
2012-01-01 03:00:00 | NaN | NaN | NaN | NaN | NaN |
2012-01-01 04:00:00 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... |
2020-12-31 19:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0000 |
2020-12-31 20:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0000 |
2020-12-31 21:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0001 |
2020-12-31 22:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0001 |
2020-12-31 23:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0000 |
78912 rows × 5 columns
Engineer Data¶
Now we can add some columns. I am interested in seeing how the precipitation varies across season and part of day so I’ll be adding those columns.
[11]:
df_sub['season'] = xr.DataArray(df_sub.index).dt.season
df_sub['part_of_day'] = pd.cut(
df_sub.index.hour, [0, 6, 12, 18, 24], include_lowest=True,
labels=['Midnight', 'Morning', 'Afternoon', 'Evening'])
df_sub
[11]:
station | lon | lat | elevation | p01m | season | part_of_day | |
---|---|---|---|---|---|---|---|
valid | |||||||
2012-01-01 00:00:00 | NaN | NaN | NaN | NaN | NaN | DJF | Midnight |
2012-01-01 01:00:00 | NaN | NaN | NaN | NaN | NaN | DJF | Midnight |
2012-01-01 02:00:00 | NaN | NaN | NaN | NaN | NaN | DJF | Midnight |
2012-01-01 03:00:00 | NaN | NaN | NaN | NaN | NaN | DJF | Midnight |
2012-01-01 04:00:00 | NaN | NaN | NaN | NaN | NaN | DJF | Midnight |
... | ... | ... | ... | ... | ... | ... | ... |
2020-12-31 19:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0000 | DJF | Evening |
2020-12-31 20:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0000 | DJF | Evening |
2020-12-31 21:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0001 | DJF | Evening |
2020-12-31 22:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0001 | DJF | Evening |
2020-12-31 23:00:00 | SEA | -122.3144 | 47.4447 | 137.0 | 0.0000 | DJF | Evening |
78912 rows × 7 columns
Export Data¶
I am happy with the data. We can save the data so we don’t have to rerun all that again.
[12]:
df_sub.to_pickle('ksea_df.pkl')