In attack perspective for machine learning, we manipulate dataset values to unexpected ones. This may destroy the performance of ML models by inserting inappropriate (or nonsense) values. However, to
Prepare Dataset
Before manipulation, load dataset as DataFrame as Pandas.
import pandas as pddf = pd.read_csv('example.csv', index_col=0)
Data Analysis
Before attacking, need to investigate the dataset and find the points where we can manipulate and fool models and people.
# Informationdf.info()# Dimensionalitydf.shape# Data typesdf.dtypes# Correlation of Columnsdf.corr# Histgramdf.hist()
Access Values
# The first 5 rowsdf.head()df.iloc[:5]df.iloc[:5].values # as NumPy# The first 10 rowsdf.head(10)df.iloc[:10]df.iloc[:10].values # as NumPy# The first 100 rowsdf.head(100)df.iloc[:100]df.iloc[:100].values # as NumPy# The last 5 rowsdf.tail()df.iloc[-5:]df.iloc[-5:].values # as NumPy# The last 10 rowsdf.tail(10)df.iloc[-10:]df.iloc[-10:].values # as NumPy# The last 100 rowsdf.tail(100)df.iloc[-100:]df.iloc[-100:].values # as NumPy# The first rowdf.iloc[0]df.iloc[[0]]# The 1st and the 2nd rowsdf.iloc[[0,1]]# From the 3rd row to the 8th rowdf.iloc[2:8]# The last row and all columnsdf.iloc[-1:,:]# All rows and first columndf.iloc[:,0]# Exclude the last row and all columnsdf.iloc[:-1,:]# Exclude the last column and all rowsdf.iloc[:,:-1]# Rows where 'Sex' is 'male'df.loc[df['Sex']=='male']# Rows where 'Age' is 18 or moredf.loc[df['Age']>=18]# Rows where 'Name' contains 'Emily'df.loc[df['Name'].str.contains('Emily')]# Rows where 'Hobby' is 'Swimming' AND 'Age' is over 25df.loc[df['Hobby']=='Swimming'& (df['Age']>25)]# Rows where 'Hobby' is 'Swimming' AND 'Age' is over 25 AND 'Age' is NOT 30df.loc[df['Hobby']=='Swimming'& (df['Age']>25) &~(df['Age']==30)]
Attacks
After analyzing data, we're ready to attack this.
Value Overriding
Override the values to abnormal or unexpected values.
# Set 'Adult' to 0 for rows where 'Age' is 18 or higherdf.loc[df['Age']>=18,'Adult']=0# Set 'Adult' to 1 for rows where 'Age' is lower than 18df.loc[df['Age']<18,'Adult']=1# Set 'Score' to -1 for all rowsdf.iloc[:,'Score']=-1# Set 'Score' to 100 for the last 10 rowsdf.loc[df.index[-2:],'Score']=100# Set John's score to 0 (...attacker may have a grudge against John)df.iloc[df['Name']=='John','Score']=0# Replace unexpected valuesdf["Gender"]= df["Gender"].replace("male", 0)df["Gender"]= df["Gender"].replace("female", -77)
Filling Missing (NaN) Values with Inappropriate Methods
Typically, NaN values are filled with the mean of the values. However in attack perspective, other methods can be used e.g. max() or min().
# Fill with the maximum scoredf["Income"]= df["Income"].fillna(df["Income"].max())# Fill with the minimum scoredf["Income"]= df["Income"].fillna(df["Income"].min())
Another Dataset Integration
Integrating another dataset values, it may fool ML models with fake values.
For example, the following fake_scores.csv contains fake scores for each person. This changes all original scores to fake scores by creating a new DataFrame which is integrated this fake dataset.