Iris Dataset, But Make It Interesting!

Leverage your storytelling ability to its best.

Subhralina Nayak
8 min readMay 22, 2022

Why even bother doing a project on a dataset which just has four features and already clean data? This exact question is what made me do this project. Like every beginner, somewhere down my Data Science journey, I found out that I need to have unique projects if I want to catch those recruiters’ eyes. The uncommon but most wise solution here was to identify any problem or situation that would relate to me and come up with a solution for that.

Now, every person would not be able to do that. If also, they have a problem in mind, it is not necessary that Data Science would be the solution for it. This got me thinking, “Is there any way to get around this, especially for someone who does not have a ton of experience?”

Ken Jee, one of my favorite Data Scientists, answered this exact question in one of his YouTube videos. It was like he read my mind. He says it is your storytelling ability that will distinguish you from others. He then went on to give an example of how the iris data could be used to the advantage of a florist.

Iris Flowers

So, here is my take on the Iris dataset and its possible real-world application. Read on!

The Story

A florist named James had set up his small flower business in the Brooklyn area of New York in 2015, and over the years, his business had grown into a decent size and he was able to take in big orders including events like weddings. Suddenly, in March 2020, he faced a huge hit in revenue. With the pandemic around, he lost the majority of his customers who would come in to buy flowers in person, for their loved ones. In the urge to maximize his profit, he also identified another major problem. It was the wastage of flowers due to a lack of desirability. Oftentimes, a large number of flowers would be shipped to decorators that would be too small or skinny to be used in decorations. In turn, they would have to be returned or thrown out and James would have to send them more flowers.

James wanted to come up with a solution to gain his regular customers back and reduce the wastage as well. The best way to go for him was to adopt an online shop model, where customers can place orders and get them delivered to any location. James’ flower shop ran on two business models, the first a B2C model, wherein they sold flowers to direct customers, and the other B2B model, wherein they sold flowers to decorators and potpourri businesses. Ideally, the flowers that had bigger petals than sepals would be the ones chosen by in-person customers. The ones which had a little smaller petals and sepals would be used for decorations. Lastly, the very small petals flowers would be shipped off to make potpourri.

Many a time, the size of the flowers would not be very distinguishable which would lead to a mix-up in the sale. And with online ordering, it would be James’ responsibility to pick the best flowers for the personal gifting orders, to ensure the best customer experience. So, James thought of implementing a machine learning model that could segregate the flowers into three categories, A: Personal gifting, B: Decorations, and C: Potpourri. This model would use the dimensions of the flowers to make the classification. To test this model for its performance, James thought it would be good to try it out on just one type of flower. He chose the Iris flowers for his first iteration.

So, let’s build a model for James!

The Data

The Iris dataset is one of the most popular datasets amongst beginners in Data Science. It is extremely easy to find as it is already pre-loaded in Python and R. I chose to download it from the link below.

As you load the data and take a closer look, you would see it has five columns: sepal_length. sepal_width, petal_length, petal_width and species. I believe all columns are self-explanatory. All dimensions here are in centimeters and the species states the type of Iris flowers.

The dataset

Our task here is to identify Iris flowers that would be suitable for either personal gifting, decoration, or potpourri. Here is an image of all the species.

Types of Iris flowers

We can observe from the picture, Versicolor seems appropriate for personal gifting, Virginica for decoration, and Setosa for potpourri. So, our end goal would be to predict the same based on inputs regarding the dimensions of the flower.

Let’s gain some insights from the data.

Exploratory Data Analysis

I will be summarizing the insights very briefly here. For the code and the respective results you can check my notebook here:

The first step would be to check for null values and duplicated rows. We don’t have any null values but there are three duplicate rows. I decided to drop the duplicated rows because it did not disturb the balance in my opinion, as the numbers are very small.

Then I gathered some more info on the data like the data types and descriptive statistics.

Descriptive Statistics

Let’s check for the balance in the data.

Countplot to check the balance in the target variable

All the species look to be pretty much in the balance. It’s a good thing that we don’t need to worry about class imbalance here.

I also calculate the mean, median, skewness, and kurtosis of the data. All results suggest that the data has a near-normal distribution. Check the notebook for details.

Looking at the heatmap for correlation, we can observe that petal length and petal width have a high correlation, petal width and sepal length have a good correlation as well as petal length and sepal length.

Heatmap to correlation between features

Let’s take a look at the relation between some pairs of features.

Scatter plot to look at the relation between features

The first graph shows the relation between sepal length and sepal width for each species. It is evident that overall, iris-verginica have larger sepals but iris-setosa have greater sepal length. Iris-versicolor is somewhere in the middle.

The first graph shows the relation between petal length and petal width for each species. It is evident that overall, iris-verginica has larger petals and iris-setosa have smaller petals. Iris-versicolor is somewhere in the middle.

Time to check the distribution of all features for each species.

Distplot to check for distribution of features concerning class

We can observe that iris-setosa is quite separable when it comes to petal features, while the others seems to overlap. It is difficult to separate the species on the basis of sepal features.

We must also check for outliers.

Boxplot to check for outliers and quartile information

The box plots describe that the setosa usually has smaller petals and sepals with few outliers. The Versicolor species is somewhere in the middle. The virginica species has the largest petals and sepals as compared to others.

The median values are in coherence with as observed above.

We can also check the probability distributions.

Violin plot to check for the probability distribution

The violin plots show the probability distribution of the features for each of the species. We can observe that except for sepal length and petal length of iris-setosa, all others exhibit some skewness or kurtosis.

For example, the distribution of petal length in iris-versicolor is left-skewed.

Prediction

Now that we have extracted enough insights from the data it is time to build our model.

Like the EDA, please check my notebook for the detailed code.

The most important step here is to encode the target variable, i.e. the species column because it is a categorical variable. I have used a simple map function to encode the three species.

# encode the target variable
df["species"] = df["species"].map({"Iris-setosa":0,"Iris-versicolor":1,"Iris-virginica":2})

After performing the necessary steps like splitting the X and y variables and splitting the dataset into training and validation sets, I have trained a Logistic Regression model on the data.

The model gave an accuracy of 98% and good precision-recall values as well.

The classification report

Finally, I gave my own input to the trained model to check the prediction

# predict species based on list of inputs
predcition = model.predict([[5.0,3.5,1.4,0.4]])
if predcition[0] == 0:
print("Export to Potpourri Businesses")
elif predcition[0] == 1:
print("Sell for Personal Gifting")
else:
print("Export to Decorators")

Conclusion

The model we built will help James in identifying and segregating the Iris flowers into proper categories. This would enable him to give maximum satisfaction to all of his customers and at the same time maximize his profits.

Future Work

The project here is very basic and good for someone to start with. The real star here is the story.

If this intrigued you, I have a few suggestions to enhance this project:

  1. You can collect data for other flowers and try to categorize them as well.
  2. Perform feature engineering. You can create features like ratios of lengths, widths, etc.
  3. Implement Computer Vision using OpenCV to detect the dimensions of the flowers from an image.
  4. Deploy this model using Flask API.

--

--