The Iris Dataset

This post is focused on a well-known dataset — the Iris flower dataset — which came to be commonly used (as a typical test case of new statistical classification techniques) after R.A. Fisher1 discussed it in a 1936 paper2. The set contains 150 data points, each one corresponding to an individual Iris flower from one of three Iris flower species: Iris setosa, Iris virginica, and Iris versicolor. Each species makes up an equal portion of the dataset (there are 50 points for each species) and each point in the dataset has four coordinates, one per measurement: sepal length, sepal width, petal length, and petal width, all in centimeters.

Iris setosa, virginica, versicolor. Images by G. Robertson, E. Hunt, Radomil. © CC BY-SA 3.0

Let’s use the Perceptron algorithm to find a hyperplane that separates the Iris setosa species from the other two species — the setosa species is, in this data set, linearly separable from the others. Those points that are from an Iris setosa plant will be given label \(y_i = 1\), and the others will be given label \(y_i = -1\).

In this process, I will also demonstrate using a training set \(\mathcal X^{train}\) to determine the model \(h:\mathcal X \to\mathcal Y\). The goal is for the model to have low generalization error, meaning that with high likelihood it will correctly label data points that it did not see during training. For this reason, I don’t use all the available data. Instead, I sample a subset, use that to train, and hold the rest of the points in reserve. After obtaining the model, I can then check the model’s performance on the reserve points.

After choosing \(\mathcal X^{train}\), I implement the Perceptron algorithm in Python on \(\mathcal X^{train}\). The primary choice for implementing the algorithm concerns how to find a “mistake” — a vector \({\bf X_i}\) where \(y_i {\bf W^{(t)}}\cdot{\bf X_i} \le 0\). (Once there are no “mistakes,” the algorithm ends.) The data in \(\mathcal X^{train}\) will come in a particular order. My choice for finding mistakes: cycle through \(\mathcal X^{train}\) in the given order; if I am at the point \({\bf X_i}\) in the list and it is a mistake, update the \({\bf W}\) vector: \({\bf W}^{(t+1)} = {\bf W}^{(t)} + y_i {\bf X_i}\). Then, continue cycling through \(\mathcal X^{train}\) from the current position. Once the procedure makes a full cycle through \(\mathcal X^{train}\), finding no mistakes, it stops.


Code (with comments) to implement the algorithm. The list of data is called x; the list of corresponding labels is called y. To account for the possibility that the input is not linearly separable, it contains a cut-off maxr, which is the maximum number of runs through the whole set that is allowed. The default value is 15000, but the user can choose a different value.

The package NumPy is used here to deal with vectors. I created a shorthand, np, to call something from NumPy (a rather common shorthand).

def perceptron(x, y, maxr = 15000):
  # Initiate W as list of zeros, with length = one plus length of first vector in x
  W = np.array([0]*(len(x[0])+1))
  # Make bigx vectors: 1 appended at end
  bigx = np.array([np.append(x[0],1)])
  for i in range(1,len(x)): 
    bigx = np.vstack((bigx, np.append(x[i],1))) 
  # counter: total number of times so far through instance list; 
  # done: remains 1 to end of while loop only if all the products are >0
  # T: number of updates to W 
  counter, done, T = 0, 0, 0
  # while loop ends if counter reaches maxr 
  while (all([done == 0, counter <= maxr])): counter += 1 done = 1 for i in range(len(bigx)): if not(y[i] * np.dot(W, bigx[i]) > 0): 
        done = 0 
        W = W + y[i] * bigx[i] 
        T += 1 
  print('Finished computation: T = ', T) 
  return W

I can then load the Iris dataset within Python by importing load_iris from sklearn.datasets3. This gives a Pandas DataFrame4, which I named df_iris, the first 50 rows of which are points for the species Iris setosa. As a consequence, a list of valid labels would be formed by

 label_set = np.array([1]*50 + [-1]*100)

Then, I create a training subset by randomly sampling row indices between 0 and 149, inclusive, and creating a new DataFrame with those rows:

 training_rows = np.random.choice(150, size=60, replace=False)
 Xtrain = df_iris.loc[training_rows,df_iris.columns].reset_index(drop=True)

The above code does random sampling without replacement. (Notice the replace=False option.) Finally, the following will produce the vector \({\bf W} = ({\bf w},b)\) that determines the hyperplane.

W_result = perceptron(Xtrain.values, label_set[training_rows])

The vector W_result will depend on which rows were placed in Xtrain. Regardless, it will likely produce a perfect linear separation of Iris setosa from the other two species. The implementation seems to consistently need only a single digit number of iterations (the value T, which equals the number of times \({\bf W}\) is changed).

To test W_result on the points held in reserve we must check, for each point, whether it is on the positive or negative side of the W_result hyperplane. Then compare that to its label in label_set. Below I provide functions that will carry out this test. The function halfspace_function takes a list, containing W_result only, and a point; it returns +1 or -1 depending on the side of the hyperplane the point resides. The function loss_01 takes as input the name of a model function, the list of parameters that determine the function (in this case, [W_result]), the list of data, and the corresponding labels. It returns the total number of points that were mislabeled by the model function.

So, to test our answer, we call the following. It likely returns 0 (in some cases it may return a small number of mistakes, depending on the training set).

loss_01(halfspace_function, [W_result], df_iris.values, label_set)

def halfspace_function(param_list, x_pt):
  W = param_list[0]
  bigx_pt = np.append(x_pt, 1)
  assert len(W) == len(bigx_pt)
  function_value = np.dot(W, bigx_pt)
  # Choose to have prediction 1 when x_pt is on the hyperplane
  if function_value == 0:
    return 1
  else:
    return int(function_value/abs(function_value))
def loss_01(h, parameters, instance_data, labels):
  total_loss = 0
  for i in range(len(instance_data)):
    if (h(parameters, instance_data[i]) != labels[i]):
      total_loss += 1
    else:
      continue
  return total_loss
  1. Fisher was a statistician and biologist. His contributions are many, including the statistical techniques ANOVA & Maximum Likelihood Estimation.
  2. The use of multiple measurements in taxonomic problems, by R.A. Fisher. The measurements for the data were taken by the botanist Edgar S. Anderson.
  3. See load_iris documentation here.
  4. See Pandas documentation.