The Iris data set

5.4 μs
5.6 μs

[1] Data set description and possible applications

3.1 μs

This data set contains 150 samples iris flower. The features in each sample are the length and width of both the iris petal and sepal, and also the species of iris. data = 150×5

Each feature is recorded as a floating point value except for the species (string). The species identifier acts as the labels for this data set (if used for supervised learning).There are no missing values. The data and header is seperated into two different files.

This data could be used for iris classification. This could be useful in an automation task involving these flowers or as a tool for researchers to assist in quick identification. Other, less "real world" applications include use as a data set for ML systems such as supervised learning (NN) and unsupervised learning (K-NN).

4.9 μs

[2] Data summary and visualizations

2.5 μs

Imports

2.9 μs
9.3 s
270 ns

Loading, cleaning, and manipulating the data

2.6 μs

Column names: sepal_len, sepal_wid, petal_len, petal_wid, class

4 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64Float64Float64Float64NothingNothingDataType
1sepal_len5.843334.35.87.9Float64
2sepal_wid3.0542.03.04.4Float64
3petal_len3.758671.04.356.9Float64
4petal_wid1.198670.11.32.5Float64

467 μs
294 ns

Splitting the data into three iris classes

As you can see, there is a equal representation of each class:

3.8 μs

Class sizes: (50, 5), (50, 5) (50, 5)

41.9 μs
276 ns

Visualizations

2.6 μs

Comparing length vs width of the sepal and petal

2.6 μs
Plots.jl
7.1 ms
287 ns

Comparing all combinations of variables

Column pairs per chart: [sepal_len, sepal_wid, petal_len, petal_wid, class]

-> [1, 2] , [1, 3] , [1, 4]

-> [2, 3] , [2, 4] , [3, 4]

28.1 μs
Plots.jl
56782.02.53.03.54.05678123456756780.51.01.52.02.52.02.53.03.54.012345672.02.53.03.54.00.51.01.52.02.512345670.51.01.52.02.5
y1y2y3y4y5y6
55.7 ms
328 ns

Comparing the sepal length vs sepal width vs petal length of all three classes of iris

Restricted to three variables to plot in 3d

5.4 μs
Plots.jl
5.2 ms
262 ns

[3] Deep Learning

4 μs

Imports

2.6 μs
887 ms
274 ns

The Data

2.6 μs

Formating data for training (including onehot conversion and (NOT) moving to gpu)

15.4 ms
280 ns

Creating DataLoaders for batches

27.2 μs
251 ns

The model

I am going to implement a fully connected neural network to classify by species.

Layers: Chain(Dense(4, 8, relu), Dense(8, 3), softmax)

Loss: logit binary crossentropy

Optimizer: Flux.Optimise.ADAM

Learning rate: 0.001

Epochs: 30

Batch size: 1

22.4 μs

Training!

Train

acc: 0.9619047619047619

loss: 0.6096755f0

Test

acc: 1.0

loss: 0.6018428f0

6.1 s
267 ns

Results

3.6 μs
Plots.jl
01000200030000.600.650.700.75
Training lossTesting lossepochs * data sizeLoss
2.5 ms
Plots.jl
01000200030000.40.50.60.70.80.91.0
Training accTesting accepochs * data sizeAccuracy
2.4 ms

One example prediction:

Prediction: 0.0074335323 , 0.8525481 , 0.14001828

Truth: 0 , 1 , 0

error: 0.2949037f0

70.6 μs

Confusion matrix

2.7 μs
Plots.jl
87.9 ms
293 ns

[4] Conclusion

Platform/Tools

I chose to implement a basic feed forward neural network because of the scale of the problem. With the data set containing so few samples with very little features a small network would fit better. Again, because of the size of the problem, shallow ML approaches would have been sufficient. Something to expand on in this research is to compare to such methods.

I wanted to challenge myself and learn an entirely new language and platform for this project. The Julia Programming Language is a high level, dynamically typed language. It comes with its own web-based editor that is much like Python's Jupter notebooks. Because Julia is newer and the community is smaller than Python, the documentation and support were not even close in magnitude. This slowed me down considerably. Despite the setbacks, I learned a lot in this research and I am glad I decided to use Julia.

Results

My model's test accuracy was 95.55%. This is satisfactory for me due to the simplicity of the data set and the model. While one species was linearly seperable, the other two were not. These later species are the main problem for the model to tackle.

As I stated in the beginning of this paper, this model could be used for classification tasks such as automation or as a tool for bio researchers to aid in identification. Furthermore, this model could be used as a pre-trained model for more specific tasks; I understand this statement is a bit of a stretch but I want to account for as many applications as possible.

8.7 μs

[5] Related work

Related research: Kaggle

One thing they did, that I didn't do, is compare their deep learning model to more classical approaches.

6.5 μs
3 μs