Preprocess¶

The first step in developing the envisioned web app is opening a dataset and preprocessing it!

Get Data¶

I like to go to https://mesonet.agron.iastate.edu/request/download.phtml?network=WA_ASOS for station data because of the easy-to-use interface and straightforward way to load data.

Select a network
Select stations
Select variables
Select a timezone
Select options
Click “Get Data”

Import Packages¶

[1]:

import numpy as np
import pandas as pd
import xarray as xr
import hvplot.pandas
import holoviews as hv
hv.renderer('bokeh').theme = 'caliber'

Load Data¶

Super straightforward to load the queried dataset; simply pass the url to pandas.

I selected Seattle because it’s notorious (maybe?) for its rainy days.

[2]:

url = (
    'https://mesonet.agron.iastate.edu/cgi-bin/'
    'request/asos.py?station=SEA&data=p01m&'
    'year1=1928&month1=1&day1=1&'
    'year2=2021&month2=3&day2=22&'
    'tz=America%2FLos_Angeles&'
    'format=onlycomma&latlon=yes&'
    'elev=yes&missing=empty&'
    'trace=0.0001&direct=n&'
    'report_type=1&report_type=2'
)
df = pd.read_csv(url, parse_dates=True, index_col='valid')

Explore Data¶

Now let’s take a quick look at some stats and the ends of the dataframe.

[3]:

df.describe()
df.head(10)
df.tail(10)

[3]:

	lon	lat	elevation	p01m
count	1.102588e+06	1.102588e+06	1102588.0	174587.000000
mean	-1.223144e+02	4.744470e+01	137.0	0.464076
std	1.550348e-09	7.358242e-10	0.0	0.872992
min	-1.223144e+02	4.744470e+01	137.0	0.000000
25%	-1.223144e+02	4.744470e+01	137.0	0.000000
50%	-1.223144e+02	4.744470e+01	137.0	0.250000
75%	-1.223144e+02	4.744470e+01	137.0	0.510000
max	-1.223144e+02	4.744470e+01	137.0	42.420000

[3]:

	station	lon	lat	elevation	p01m
valid
1948-01-01 00:00:00	SEA	-122.3144	47.4447	137.0	NaN
1948-01-01 01:00:00	SEA	-122.3144	47.4447	137.0	NaN
1948-01-01 02:00:00	SEA	-122.3144	47.4447	137.0	NaN
1948-01-01 03:00:00	SEA	-122.3144	47.4447	137.0	NaN
1948-01-01 04:00:00	SEA	-122.3144	47.4447	137.0	NaN
1948-01-01 05:00:00	SEA	-122.3144	47.4447	137.0	NaN
1948-01-01 06:00:00	SEA	-122.3144	47.4447	137.0	NaN
1948-01-01 07:00:00	SEA	-122.3144	47.4447	137.0	NaN
1948-01-01 08:00:00	SEA	-122.3144	47.4447	137.0	NaN
1948-01-01 09:00:00	SEA	-122.3144	47.4447	137.0	NaN

[3]:

	station	lon	lat	elevation	p01m
valid
2021-03-21 23:15:00	SEA	-122.3144	47.4447	137.0	NaN
2021-03-21 23:20:00	SEA	-122.3144	47.4447	137.0	NaN
2021-03-21 23:25:00	SEA	-122.3144	47.4447	137.0	NaN
2021-03-21 23:30:00	SEA	-122.3144	47.4447	137.0	NaN
2021-03-21 23:35:00	SEA	-122.3144	47.4447	137.0	NaN
2021-03-21 23:40:00	SEA	-122.3144	47.4447	137.0	NaN
2021-03-21 23:45:00	SEA	-122.3144	47.4447	137.0	NaN
2021-03-21 23:50:00	SEA	-122.3144	47.4447	137.0	NaN
2021-03-21 23:53:00	SEA	-122.3144	47.4447	137.0	0.0
2021-03-21 23:55:00	SEA	-122.3144	47.4447	137.0	NaN

Clean Data¶

There seems to be a lot of NaNs so let’s drop them!

[4]:

df = df.dropna()

Now let’s check out the head and tail again!

At first, the time step seems to be about 2-3 hours, slowly transitioning to hourly, about 5 minutes near the end, and turns into hourly again at the last row.

[5]:

df.head(10)
df.tail(10)

[5]:

	station	lon	lat	elevation	p01m
valid
1948-12-13 00:00:00	SEA	-122.3144	47.4447	137.0	0.25
1965-01-01 14:00:00	SEA	-122.3144	47.4447	137.0	0.25
1965-01-01 15:00:00	SEA	-122.3144	47.4447	137.0	0.25
1965-01-01 18:00:00	SEA	-122.3144	47.4447	137.0	0.25
1965-01-01 23:00:00	SEA	-122.3144	47.4447	137.0	0.25
1965-01-02 00:00:00	SEA	-122.3144	47.4447	137.0	0.25
1965-01-02 02:00:00	SEA	-122.3144	47.4447	137.0	0.51
1965-01-02 03:00:00	SEA	-122.3144	47.4447	137.0	0.51
1965-01-02 04:00:00	SEA	-122.3144	47.4447	137.0	1.02
1965-01-02 05:00:00	SEA	-122.3144	47.4447	137.0	0.51

[5]:

	station	lon	lat	elevation	p01m
valid
2021-03-21 22:15:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 22:20:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 22:25:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 22:30:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 22:35:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 22:40:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 22:45:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 22:50:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 22:53:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 23:53:00	SEA	-122.3144	47.4447	137.0	0.00

Because of this irregularity, I plan to resample the dataset to hourly. However, does the repeated 0.25 indicate 0.25 mm of precipitation every 5 minutes or an accumulation of 0.25 mm of precipitation for the last hour?

Thankfully, there’s documentation at https://mesonet.agron.iastate.edu/request/download.phtml?network=WA_ASOS:

*p01i: One hour precipitation for the period from the observation time to the time of the previous hourly precipitation reset*. This varies slightly by site. Values are in inches. This value may or may not contain frozen precipitation melted by some device on the sensor or estimated by some other means. Unfortunately, we do not know of an authoritative database denoting which station has which sensor.

Based on that information, we should take the maximum of each hour.

[6]:

df_rs = df.resample('1H').max()

We can take a look at the stats and ends again.

[7]:

df_rs.describe()
df_rs.head()
df_rs.tail()

[7]:

	lon	lat	elevation	p01m
count	1.235910e+05	1.235910e+05	123591.0	123591.000000
mean	-1.223144e+02	4.744470e+01	137.0	0.425334
std	1.085714e-10	6.114245e-11	0.0	0.862160
min	-1.223144e+02	4.744470e+01	137.0	0.000000
25%	-1.223144e+02	4.744470e+01	137.0	0.000000
50%	-1.223144e+02	4.744470e+01	137.0	0.000000
75%	-1.223144e+02	4.744470e+01	137.0	0.510000
max	-1.223144e+02	4.744470e+01	137.0	42.420000

[7]:

	station	lon	lat	elevation	p01m
valid
1948-12-13 00:00:00	SEA	-122.3144	47.4447	137.0	0.25
1948-12-13 01:00:00	NaN	NaN	NaN	NaN	NaN
1948-12-13 02:00:00	NaN	NaN	NaN	NaN	NaN
1948-12-13 03:00:00	NaN	NaN	NaN	NaN	NaN
1948-12-13 04:00:00	NaN	NaN	NaN	NaN	NaN

[7]:

	station	lon	lat	elevation	p01m
valid
2021-03-21 19:00:00	SEA	-122.3144	47.4447	137.0	0.00
2021-03-21 20:00:00	SEA	-122.3144	47.4447	137.0	0.00
2021-03-21 21:00:00	SEA	-122.3144	47.4447	137.0	0.00
2021-03-21 22:00:00	SEA	-122.3144	47.4447	137.0	0.25
2021-03-21 23:00:00	SEA	-122.3144	47.4447	137.0	0.00

Early on, there seems to be a lot of missing data. I want to remove the years with a bunch of missing data so it doesn’t skew the downstream statistics. But first, let’s check out how much missing there is to figure out a threshold.

[8]:

yr_cnt = df_rs.groupby(df_rs.index.year)['p01m'].count().sort_values()
yr_cnt

[8]:

valid
1956       0
1962       0
1961       0
1960       0
1959       0
        ...
2014    8737
2018    8744
2019    8748
2020    8762
2016    8769
Name: p01m, Length: 74, dtype: int64

There seems to be around 8784 rows (24 hours * 366 days) so let’s only keep the years where more than 90% of the year is available.

[9]:

yr_cnt = yr_cnt.loc[yr_cnt > 0.9 * 8784]
yr_cnt
df_sub = df_rs.loc[df_rs.index.year.isin(yr_cnt.index)].copy()

[9]:

valid
2012    8426
2017    8715
2015    8732
2013    8735
2014    8737
2018    8744
2019    8748
2020    8762
2016    8769
Name: p01m, dtype: int64

That leaves us with only 10 years to work with.

[10]:

df_sub

[10]:

	station	lon	lat	elevation	p01m
valid
2012-01-01 00:00:00	NaN	NaN	NaN	NaN	NaN
2012-01-01 01:00:00	NaN	NaN	NaN	NaN	NaN
2012-01-01 02:00:00	NaN	NaN	NaN	NaN	NaN
2012-01-01 03:00:00	NaN	NaN	NaN	NaN	NaN
2012-01-01 04:00:00	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...
2020-12-31 19:00:00	SEA	-122.3144	47.4447	137.0	0.0000
2020-12-31 20:00:00	SEA	-122.3144	47.4447	137.0	0.0000
2020-12-31 21:00:00	SEA	-122.3144	47.4447	137.0	0.0001
2020-12-31 22:00:00	SEA	-122.3144	47.4447	137.0	0.0001
2020-12-31 23:00:00	SEA	-122.3144	47.4447	137.0	0.0000

78912 rows × 5 columns

Engineer Data¶

Now we can add some columns. I am interested in seeing how the precipitation varies across season and part of day so I’ll be adding those columns.

[11]:

df_sub['season'] = xr.DataArray(df_sub.index).dt.season

df_sub['part_of_day'] = pd.cut(
    df_sub.index.hour, [0, 6, 12, 18, 24], include_lowest=True,
    labels=['Midnight', 'Morning', 'Afternoon', 'Evening'])

df_sub

[11]:

	station	lon	lat	elevation	p01m	season	part_of_day
valid
2012-01-01 00:00:00	NaN	NaN	NaN	NaN	NaN	DJF	Midnight
2012-01-01 01:00:00	NaN	NaN	NaN	NaN	NaN	DJF	Midnight
2012-01-01 02:00:00	NaN	NaN	NaN	NaN	NaN	DJF	Midnight
2012-01-01 03:00:00	NaN	NaN	NaN	NaN	NaN	DJF	Midnight
2012-01-01 04:00:00	NaN	NaN	NaN	NaN	NaN	DJF	Midnight
...	...	...	...	...	...	...	...
2020-12-31 19:00:00	SEA	-122.3144	47.4447	137.0	0.0000	DJF	Evening
2020-12-31 20:00:00	SEA	-122.3144	47.4447	137.0	0.0000	DJF	Evening
2020-12-31 21:00:00	SEA	-122.3144	47.4447	137.0	0.0001	DJF	Evening
2020-12-31 22:00:00	SEA	-122.3144	47.4447	137.0	0.0001	DJF	Evening
2020-12-31 23:00:00	SEA	-122.3144	47.4447	137.0	0.0000	DJF	Evening

78912 rows × 7 columns

Export Data¶

I am happy with the data. We can save the data so we don’t have to rerun all that again.

[12]:

df_sub.to_pickle('ksea_df.pkl')

pYdeas 0.0.0 documentation