The point of salting and hashing passwords, demonstrated with Python¶
Story here: https://medium.com/p/4879a156d99a
When I first heard about this, I thought, “hash browns?” but nope!
The concept of salting and hashing has to do with securing databases that contain login credentials, so that even if the hackers somehow gained access to the database, they still can’t do much with the information.
To demonstrate this, let’s create a pandas dataframe, depicting an actual database, and add random login credentials of a few users.
[1]:
import pandas as pd
usernames = ["apple", "banana", "cherry"]
passwords = ["apples_are_THE_be5t!", "password123", "password123"]
df = pd.DataFrame({"username": usernames, "password": passwords})
display(df)
username | password | |
---|---|---|
0 | apple | apples_are_THE_be5t! |
1 | banana | password123 |
2 | cherry | password123 |
Here, the password is stored insecurely as plain-text. If a hacker was able to access this database, they can see the actual password and readily use the credentials to login.
One way to slowdown the hacker is to hash the password before storing it. Hashing converts some input into a fixed-size string. Providing the same input to the hashing algorithm will yield the same output, but hashing only goes one-way, meaning someone can’t use the hashed output to back out the original input.
[2]:
import hashlib
import pandas as pd
usernames = ["apple", "banana", "cherry"]
passwords = ["apples_are_THE_be5t!", "password123", "password123"]
# hash the passwords
hashes = [
hashlib.sha256(password.encode("utf-8")).hexdigest()
for password in passwords
]
df = pd.DataFrame({"username": usernames, "hash": hashes})
display(df)
username | hash | |
---|---|---|
0 | apple | f3520dba0533d4249de493cb54106a41dd03837fd056a3... |
1 | banana | ef92b778bafe771e89245b89ecbc08a44a4e166c066599... |
2 | cherry | ef92b778bafe771e89245b89ecbc08a44a4e166c066599... |
So how do authentication systems with hashed passwords work? Since these authentication systems no longer store original passwords, they can’t just compare the user’s input to the stored hash value to validate.
They can, however, use the same hashing algorithm on the user’s input–then compare whether the output hash matches the stored hash, because remember, the hashing algorithms are deterministic: the same input results in the same output. That is also why the banana and cherry users have the same hash.
Now, if the hacker gained access to this database, they can’t immediately use the hash to login because the authentication system expects plain-text passwords and if the hacker provides the hash, that would rehash the hash, resulting in a different hash, and ultimately an invalid login.
The hacker could try guessing random common passwords through the login page. However, if the authentication system detects too many failed attempts, the hacker could be locked out, and perhaps, even alert the security team.
Instaed, hackers pre-compute hashes of common passwords and use that to back out the original password, as depicted below.
[3]:
common_passwords = ["123456789", "password123", "qwerty123"]
common_hashes = [
hashlib.sha256(password.encode("utf-8")).hexdigest()
for password in common_passwords
]
precomputed_df = pd.DataFrame({"password": common_passwords, "hash": common_hashes})
display("precomputed", precomputed_df)
compromised_df = pd.merge(df, precomputed_df, on="hash")
display("compromised", compromised_df)
'precomputed'
password | hash | |
---|---|---|
0 | 123456789 | 15e2b0d3c33891ebb0f1ef609ec419420c20e320ce94c6... |
1 | password123 | ef92b778bafe771e89245b89ecbc08a44a4e166c066599... |
2 | qwerty123 | daaad6e5604e8e17bd9f108d91e26afe6281dac8fda009... |
'compromised'
username | hash | password | |
---|---|---|---|
0 | banana | ef92b778bafe771e89245b89ecbc08a44a4e166c066599... | password123 |
1 | cherry | ef92b778bafe771e89245b89ecbc08a44a4e166c066599... | password123 |
With just a few lines of code, the hacker has now easily compromised two of the users’ credentials because they both used a common password!
So how do authentication systems slowdown the hacker further and prevent multiple users from being compromised at once, even if the users used the same, common passwords?
That is where salting comes to play. Salting simply means adding extra characters to the provided password before hashing.
[4]:
import secrets
import hashlib
import pandas as pd
usernames = ["apple", "banana", "cherry"]
passwords = ["apples_are_THE_be5t!", "password123", "password123"]
salts = [secrets.token_hex(8) for i in range(len(passwords))]
# salt the passwords
salted_passwords = [
password + salt for password, salt in zip(passwords, salts)
]
display("salted_passwords", salted_passwords)
# hash the passwords
hashes = [
hashlib.sha256(password.encode("utf-8")).hexdigest()
for password in salted_passwords
]
df = pd.DataFrame({"username": usernames, "salt": salts, "hash": hashes})
display(df)
'salted_passwords'
['apples_are_THE_be5t!9a2da2ab69bd17c1',
'password12304e7485237cabba4',
'password12376579477069b44c0']
username | salt | hash | |
---|---|---|---|
0 | apple | 9a2da2ab69bd17c1 | ddd17d2faebd8d0f1ba1598be050ee60e60193dfcfd414... |
1 | banana | 04e7485237cabba4 | ac7656830c64631a6e8f15b4690a1257096bcc0e52f58a... |
2 | cherry | 76579477069b44c0 | 662b6e6ef167cae168c7ad423b0843abc452e536fbceec... |
With this included, to authenticate, authentication systems now also would have to salt the the user’s input before hashing and comparing.
Also, notice, although the banana and cherry users used the same password, the stored hashes are now different, and that means, the hacker can’t compromise both their credentials in one go!
However salting isn’t foolproof; it simply slows down the hacker. With sufficient time, the hacker can still loop through the common passwords, add the salt, hash the salted password, and back out the original password.
[5]:
common_passwords = ["123456789", "password123", "qwerty123"]
common_salted_passwords = {
password + salt: password for password in common_passwords for salt in df["salt"]
}
common_hashes = [
hashlib.sha256(password.encode("utf-8")).hexdigest()
for password in common_salted_passwords.keys()
]
precomputed_df = pd.DataFrame(
{
"password": common_salted_passwords.values(),
"salted_password": common_salted_passwords.keys(),
"hash": common_hashes,
}
)
display("precomputed", precomputed_df)
compromised_df = pd.merge(df, precomputed_df, on="hash")
display("compromised", compromised_df)
'precomputed'
password | salted_password | hash | |
---|---|---|---|
0 | 123456789 | 1234567899a2da2ab69bd17c1 | 4d7c7e9f923bf0275e1d9c4ab10eaefaf4462ee6dead13... |
1 | 123456789 | 12345678904e7485237cabba4 | 5a8fdee3760c6246fa4cc63f69f154da895fd7da42f712... |
2 | 123456789 | 12345678976579477069b44c0 | d324fb639362b9e797068ff910dfd21f8aa19260e5a8fa... |
3 | password123 | password1239a2da2ab69bd17c1 | 2e0fb1aefd60d95efd48ad6b2e76fed1186aa7167aef65... |
4 | password123 | password12304e7485237cabba4 | ac7656830c64631a6e8f15b4690a1257096bcc0e52f58a... |
5 | password123 | password12376579477069b44c0 | 662b6e6ef167cae168c7ad423b0843abc452e536fbceec... |
6 | qwerty123 | qwerty1239a2da2ab69bd17c1 | f245aa171123276958bcaf0a1cc97222c083455bf1fed4... |
7 | qwerty123 | qwerty12304e7485237cabba4 | 038863cb22fd1574409cc7a7464d935ac7dd57a655998a... |
8 | qwerty123 | qwerty12376579477069b44c0 | 3027110bafde96e31fe4594f93ecf28a91d9b91f9f92ea... |
'compromised'
username | salt | hash | password | salted_password | |
---|---|---|---|---|---|
0 | banana | 04e7485237cabba4 | ac7656830c64631a6e8f15b4690a1257096bcc0e52f58a... | password123 | password12304e7485237cabba4 |
1 | cherry | 76579477069b44c0 | 662b6e6ef167cae168c7ad423b0843abc452e536fbceec... | password123 | password12376579477069b44c0 |
However, a caveat to this hacking method is if the hacker knew how the salt was added! The salt could have been appended, as in the case here, or prepended. The salt could also have been surrounded by other characters before being merged.
In addition, the salting process could also have been repeated several, or more likely, thousands of times, to yield the final salted and hashed password, like below.
[6]:
import secrets
import hashlib
import pandas as pd
usernames = ["apple", "banana", "cherry"]
passwords = ["apples_are_THE_be5t!", "password123", "password123"]
salts = [secrets.token_hex(8) for i in range(len(passwords))]
# salt the passwords with a special recipe
salted_passwords = [
f"!{password}...{salt}" * 3 for password, salt in zip(passwords, salts)
]
display("salted_passwords", salted_passwords)
# hash the passwords
hashes = [
hashlib.sha256(password.encode("utf-8")).hexdigest()
for password in salted_passwords
]
df = pd.DataFrame({"username": usernames, "salt": salts, "hash": hashes})
display(df)
'salted_passwords'
['!apples_are_THE_be5t!...38b79f10f59ec5b4!apples_are_THE_be5t!...38b79f10f59ec5b4!apples_are_THE_be5t!...38b79f10f59ec5b4',
'!password123...a3a530e7fb19b972!password123...a3a530e7fb19b972!password123...a3a530e7fb19b972',
'!password123...42b320058343b990!password123...42b320058343b990!password123...42b320058343b990']
username | salt | hash | |
---|---|---|---|
0 | apple | 38b79f10f59ec5b4 | 9b4821c3df1b8eb99ca7a259990d48e198ec3a3721971e... |
1 | banana | a3a530e7fb19b972 | 3028d394732e4e6831d8dc7c1bf79d4ce7f368d41088fb... |
2 | cherry | 42b320058343b990 | c2fa3c9d10cf1de228f4e7e0593cc34fdb8cc1a2feb9f9... |
With this implemented, the hacker now needs to know how the salt was added, and also how many times!
But I’m sure if the hacker had enough time and commitment, they would find a way to crack this!
To conclude: - hashing converts the input password into a fixed-size string - hashes are deterministic; same input returns the same output - hashes are one-way; the hash cannot be used directly to back out the original input
salting adds extra characters to the input password
salts can be added in unique ways; appended, prepended, surrounded, etc
salts can be added many times; likely thousands of times
TLDR: - hashes help prevent the hacker from immediately knowing the plain-text password - salts help prevent the hacker from knowing that two or more users use the same password - both are techniques used to slowdown the hacker from compromising credentials
And this is the extent of my knowledge. I only recently learned about these concepts, and have much more to learn about it, so please feel free to share your knowledge in the comments!
Thanks for reading!