Preprocess

The first step in developing the envisioned web app is opening a dataset and preprocessing it!

Get Data

I like to go to https://mesonet.agron.iastate.edu/request/download.phtml?network=WA_ASOS for station data because of the easy-to-use interface and straightforward way to load data.

  1. Select a network

  2. Select stations

  3. Select variables

  4. Select a timezone

  5. Select options

  6. Click “Get Data”

Import Packages

[1]:
import numpy as np
import pandas as pd
import xarray as xr
import hvplot.pandas
import holoviews as hv
hv.renderer('bokeh').theme = 'caliber'

Load Data

Super straightforward to load the queried dataset; simply pass the url to pandas.

I selected Seattle because it’s notorious (maybe?) for its rainy days.

[2]:
url = (
    'https://mesonet.agron.iastate.edu/cgi-bin/'
    'request/asos.py?station=SEA&data=p01m&'
    'year1=1928&month1=1&day1=1&'
    'year2=2021&month2=3&day2=22&'
    'tz=America%2FLos_Angeles&'
    'format=onlycomma&latlon=yes&'
    'elev=yes&missing=empty&'
    'trace=0.0001&direct=n&'
    'report_type=1&report_type=2'
)
df = pd.read_csv(url, parse_dates=True, index_col='valid')

Explore Data

Now let’s take a quick look at some stats and the ends of the dataframe.

[3]:
df.describe()
df.head(10)
df.tail(10)
[3]:
lon lat elevation p01m
count 1.102588e+06 1.102588e+06 1102588.0 174587.000000
mean -1.223144e+02 4.744470e+01 137.0 0.464076
std 1.550348e-09 7.358242e-10 0.0 0.872992
min -1.223144e+02 4.744470e+01 137.0 0.000000
25% -1.223144e+02 4.744470e+01 137.0 0.000000
50% -1.223144e+02 4.744470e+01 137.0 0.250000
75% -1.223144e+02 4.744470e+01 137.0 0.510000
max -1.223144e+02 4.744470e+01 137.0 42.420000
[3]:
station lon lat elevation p01m
valid
1948-01-01 00:00:00 SEA -122.3144 47.4447 137.0 NaN
1948-01-01 01:00:00 SEA -122.3144 47.4447 137.0 NaN
1948-01-01 02:00:00 SEA -122.3144 47.4447 137.0 NaN
1948-01-01 03:00:00 SEA -122.3144 47.4447 137.0 NaN
1948-01-01 04:00:00 SEA -122.3144 47.4447 137.0 NaN
1948-01-01 05:00:00 SEA -122.3144 47.4447 137.0 NaN
1948-01-01 06:00:00 SEA -122.3144 47.4447 137.0 NaN
1948-01-01 07:00:00 SEA -122.3144 47.4447 137.0 NaN
1948-01-01 08:00:00 SEA -122.3144 47.4447 137.0 NaN
1948-01-01 09:00:00 SEA -122.3144 47.4447 137.0 NaN
[3]:
station lon lat elevation p01m
valid
2021-03-21 23:15:00 SEA -122.3144 47.4447 137.0 NaN
2021-03-21 23:20:00 SEA -122.3144 47.4447 137.0 NaN
2021-03-21 23:25:00 SEA -122.3144 47.4447 137.0 NaN
2021-03-21 23:30:00 SEA -122.3144 47.4447 137.0 NaN
2021-03-21 23:35:00 SEA -122.3144 47.4447 137.0 NaN
2021-03-21 23:40:00 SEA -122.3144 47.4447 137.0 NaN
2021-03-21 23:45:00 SEA -122.3144 47.4447 137.0 NaN
2021-03-21 23:50:00 SEA -122.3144 47.4447 137.0 NaN
2021-03-21 23:53:00 SEA -122.3144 47.4447 137.0 0.0
2021-03-21 23:55:00 SEA -122.3144 47.4447 137.0 NaN

Clean Data

There seems to be a lot of NaNs so let’s drop them!

[4]:
df = df.dropna()

Now let’s check out the head and tail again!

At first, the time step seems to be about 2-3 hours, slowly transitioning to hourly, about 5 minutes near the end, and turns into hourly again at the last row.

[5]:
df.head(10)
df.tail(10)
[5]:
station lon lat elevation p01m
valid
1948-12-13 00:00:00 SEA -122.3144 47.4447 137.0 0.25
1965-01-01 14:00:00 SEA -122.3144 47.4447 137.0 0.25
1965-01-01 15:00:00 SEA -122.3144 47.4447 137.0 0.25
1965-01-01 18:00:00 SEA -122.3144 47.4447 137.0 0.25
1965-01-01 23:00:00 SEA -122.3144 47.4447 137.0 0.25
1965-01-02 00:00:00 SEA -122.3144 47.4447 137.0 0.25
1965-01-02 02:00:00 SEA -122.3144 47.4447 137.0 0.51
1965-01-02 03:00:00 SEA -122.3144 47.4447 137.0 0.51
1965-01-02 04:00:00 SEA -122.3144 47.4447 137.0 1.02
1965-01-02 05:00:00 SEA -122.3144 47.4447 137.0 0.51
[5]:
station lon lat elevation p01m
valid
2021-03-21 22:15:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 22:20:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 22:25:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 22:30:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 22:35:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 22:40:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 22:45:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 22:50:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 22:53:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 23:53:00 SEA -122.3144 47.4447 137.0 0.00

Because of this irregularity, I plan to resample the dataset to hourly. However, does the repeated 0.25 indicate 0.25 mm of precipitation every 5 minutes or an accumulation of 0.25 mm of precipitation for the last hour?

Thankfully, there’s documentation at https://mesonet.agron.iastate.edu/request/download.phtml?network=WA_ASOS:

*p01i: One hour precipitation for the period from the observation time to the time of the previous hourly precipitation reset*. This varies slightly by site. Values are in inches. This value may or may not contain frozen precipitation melted by some device on the sensor or estimated by some other means. Unfortunately, we do not know of an authoritative database denoting which station has which sensor.

Based on that information, we should take the maximum of each hour.

[6]:
df_rs = df.resample('1H').max()

We can take a look at the stats and ends again.

[7]:
df_rs.describe()
df_rs.head()
df_rs.tail()
[7]:
lon lat elevation p01m
count 1.235910e+05 1.235910e+05 123591.0 123591.000000
mean -1.223144e+02 4.744470e+01 137.0 0.425334
std 1.085714e-10 6.114245e-11 0.0 0.862160
min -1.223144e+02 4.744470e+01 137.0 0.000000
25% -1.223144e+02 4.744470e+01 137.0 0.000000
50% -1.223144e+02 4.744470e+01 137.0 0.000000
75% -1.223144e+02 4.744470e+01 137.0 0.510000
max -1.223144e+02 4.744470e+01 137.0 42.420000
[7]:
station lon lat elevation p01m
valid
1948-12-13 00:00:00 SEA -122.3144 47.4447 137.0 0.25
1948-12-13 01:00:00 NaN NaN NaN NaN NaN
1948-12-13 02:00:00 NaN NaN NaN NaN NaN
1948-12-13 03:00:00 NaN NaN NaN NaN NaN
1948-12-13 04:00:00 NaN NaN NaN NaN NaN
[7]:
station lon lat elevation p01m
valid
2021-03-21 19:00:00 SEA -122.3144 47.4447 137.0 0.00
2021-03-21 20:00:00 SEA -122.3144 47.4447 137.0 0.00
2021-03-21 21:00:00 SEA -122.3144 47.4447 137.0 0.00
2021-03-21 22:00:00 SEA -122.3144 47.4447 137.0 0.25
2021-03-21 23:00:00 SEA -122.3144 47.4447 137.0 0.00

Early on, there seems to be a lot of missing data. I want to remove the years with a bunch of missing data so it doesn’t skew the downstream statistics. But first, let’s check out how much missing there is to figure out a threshold.

[8]:
yr_cnt = df_rs.groupby(df_rs.index.year)['p01m'].count().sort_values()
yr_cnt
[8]:
valid
1956       0
1962       0
1961       0
1960       0
1959       0
        ...
2014    8737
2018    8744
2019    8748
2020    8762
2016    8769
Name: p01m, Length: 74, dtype: int64

There seems to be around 8784 rows (24 hours * 366 days) so let’s only keep the years where more than 90% of the year is available.

[9]:
yr_cnt = yr_cnt.loc[yr_cnt > 0.9 * 8784]
yr_cnt
df_sub = df_rs.loc[df_rs.index.year.isin(yr_cnt.index)].copy()
[9]:
valid
2012    8426
2017    8715
2015    8732
2013    8735
2014    8737
2018    8744
2019    8748
2020    8762
2016    8769
Name: p01m, dtype: int64

That leaves us with only 10 years to work with.

[10]:
df_sub
[10]:
station lon lat elevation p01m
valid
2012-01-01 00:00:00 NaN NaN NaN NaN NaN
2012-01-01 01:00:00 NaN NaN NaN NaN NaN
2012-01-01 02:00:00 NaN NaN NaN NaN NaN
2012-01-01 03:00:00 NaN NaN NaN NaN NaN
2012-01-01 04:00:00 NaN NaN NaN NaN NaN
... ... ... ... ... ...
2020-12-31 19:00:00 SEA -122.3144 47.4447 137.0 0.0000
2020-12-31 20:00:00 SEA -122.3144 47.4447 137.0 0.0000
2020-12-31 21:00:00 SEA -122.3144 47.4447 137.0 0.0001
2020-12-31 22:00:00 SEA -122.3144 47.4447 137.0 0.0001
2020-12-31 23:00:00 SEA -122.3144 47.4447 137.0 0.0000

78912 rows × 5 columns

Engineer Data

Now we can add some columns. I am interested in seeing how the precipitation varies across season and part of day so I’ll be adding those columns.

[11]:
df_sub['season'] = xr.DataArray(df_sub.index).dt.season

df_sub['part_of_day'] = pd.cut(
    df_sub.index.hour, [0, 6, 12, 18, 24], include_lowest=True,
    labels=['Midnight', 'Morning', 'Afternoon', 'Evening'])

df_sub
[11]:
station lon lat elevation p01m season part_of_day
valid
2012-01-01 00:00:00 NaN NaN NaN NaN NaN DJF Midnight
2012-01-01 01:00:00 NaN NaN NaN NaN NaN DJF Midnight
2012-01-01 02:00:00 NaN NaN NaN NaN NaN DJF Midnight
2012-01-01 03:00:00 NaN NaN NaN NaN NaN DJF Midnight
2012-01-01 04:00:00 NaN NaN NaN NaN NaN DJF Midnight
... ... ... ... ... ... ... ...
2020-12-31 19:00:00 SEA -122.3144 47.4447 137.0 0.0000 DJF Evening
2020-12-31 20:00:00 SEA -122.3144 47.4447 137.0 0.0000 DJF Evening
2020-12-31 21:00:00 SEA -122.3144 47.4447 137.0 0.0001 DJF Evening
2020-12-31 22:00:00 SEA -122.3144 47.4447 137.0 0.0001 DJF Evening
2020-12-31 23:00:00 SEA -122.3144 47.4447 137.0 0.0000 DJF Evening

78912 rows × 7 columns

Export Data

I am happy with the data. We can save the data so we don’t have to rerun all that again.

[12]:
df_sub.to_pickle('ksea_df.pkl')