The Iris data set
xxxxxxxxxx
xxxxxxxxxx
[1] Data set description and possible applications
xxxxxxxxxx
This data set contains 150 samples iris flower. The features in each sample are the length and width of both the iris petal and sepal, and also the species of iris. data = 150×5
Each feature is recorded as a floating point value except for the species (string). The species identifier acts as the labels for this data set (if used for supervised learning).There are no missing values. The data and header is seperated into two different files.
This data could be used for iris classification. This could be useful in an automation task involving these flowers or as a tool for researchers to assist in quick identification. Other, less "real world" applications include use as a data set for ML systems such as supervised learning (NN) and unsupervised learning (K-NN).
xxxxxxxxxx
[2] Data summary and visualizations
xxxxxxxxxx
Imports
xxxxxxxxxx
begin
import Pkg;
packages = ["CSV","DataFrames","PlutoUI","Plots","Combinatorics"]
Pkg.add(packages)
using CSV, DataFrames, PlutoUI, Plots, Combinatorics
plotly()
theme(:solarized_light)
end
Enter cell code...
xxxxxxxxxx
Loading, cleaning, and manipulating the data
xxxxxxxxxx
Column names: sepal_len, sepal_wid, petal_len, petal_wid, class
4 rows × 8 columns
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
Symbol | Float64 | Float64 | Float64 | Float64 | Nothing | Nothing | DataType | |
1 | sepal_len | 5.84333 | 4.3 | 5.8 | 7.9 | Float64 | ||
2 | sepal_wid | 3.054 | 2.0 | 3.0 | 4.4 | Float64 | ||
3 | petal_len | 3.75867 | 1.0 | 4.35 | 6.9 | Float64 | ||
4 | petal_wid | 1.19867 | 0.1 | 1.3 | 2.5 | Float64 |
xxxxxxxxxx
begin
path = "iris/iris.data"
csv_data = CSV.File(path, header=false)
iris_names = ["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]
df = DataFrame(csv_data.columns, Symbol.(iris_names))
dropmissing!(df)
md"""
**Column names:** $(join(iris_names, ", "))
$(describe(df, cols=1:4))
"""
end
Enter cell code...
xxxxxxxxxx
Splitting the data into three iris classes
As you can see, there is a equal representation of each class:
xxxxxxxxxx
Class sizes: (50, 5), (50, 5) (50, 5)
xxxxxxxxxx
begin
df_species = groupby(df, :class)
md"""**Class sizes:** $(size(df_species[1])), $(size(df_species[2])) $(size(df_species[3]))"""
end
Enter cell code...
xxxxxxxxxx
Visualizations
xxxxxxxxxx
Comparing length vs width of the sepal and petal
xxxxxxxxxx
xxxxxxxxxx
begin
scatter(title="len vs wid", xlabel = "length", ylabel="width",
df.sepal_len, df.sepal_wid, color="blue", label="sepal")
scatter!(df.petal_len, df.petal_wid, color="red", label="petal")
end
Enter cell code...
xxxxxxxxxx
Comparing all combinations of variables
Column pairs per chart: [sepal_len, sepal_wid, petal_len, petal_wid, class]
-> [1, 2] , [1, 3] , [1, 4]
-> [2, 3] , [2, 4] , [3, 4]
xxxxxxxxxx
xxxxxxxxxx
begin
# Get all combinations of colums
combins = collect(combinations(1:4,2))
combos = [(df[x][1], df[x][2]) for x in combins]
# Plot all combinations in sub-plots
scatter(combos, layout=(2,3))
end
Enter cell code...
xxxxxxxxxx
Comparing the sepal length vs sepal width vs petal length of all three classes of iris
Restricted to three variables to plot in 3d
xxxxxxxxxx
xxxxxxxxxx
begin
setosa, versicolor, virginica = df_species
scatter(setosa[1], setosa[2], setosa[3], label="Setosa", xlabel="d")
scatter!(versicolor[1], versicolor[2], versicolor[3], label="versicolor")
scatter!(virginica[1], virginica[2], virginica[3], label="virginica")
end
Enter cell code...
xxxxxxxxxx
[3] Deep Learning
xxxxxxxxxx
Imports
xxxxxxxxxx
0x0000007b
1464307935
1073116007
222134151
1073120226
-290652630
1072956456
-580276323
1073476387
1332671753
1073438661
-298887060
1073348697
1289889874
1073607351
1085715880
1072854758
60141430
1073707117
-1904453904
1072781553
-1535479458
1073595436
1934341771
1073230615
1429722515
1073715449
-657651521
1073293218
-929017177
1072876028
2119209372
1073158224
-953585688
1073232163
1749145680
1073583548
1096072997
1073708804
760591513
1073075131
138346874
1073030449
1049893279
1073166535
-1999907543
1597138926
-775229811
32947490
382
0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
0x00000000000000000000000000000000
1002
0
xxxxxxxxxx
begin
Pkg.add("Flux")
Pkg.add("CUDA")
Pkg.add("IterTools")
using Flux
using Flux: Data.DataLoader
using Flux:
using CUDA
using Random
using IterTools: ncycle
Random.seed!(123);
# CUDA.allowscalar(false)
end
Enter cell code...
xxxxxxxxxx
The Data
xxxxxxxxxx
Formating data for training (including onehot conversion and (NOT) moving to gpu)
xxxxxxxxxx
begin
# Convert df to array
data = convert(Array, df)
# Shuffle
data = data[shuffle(1:end), :]
# train/test split
train_test_ratio = .7
idx = Int(floor(size(df, 1) * train_test_ratio))
data_train = data[1:idx,:]
data_test = data[idx+1:end, :]
# Get feature vectors
get_feat(d) = transpose(convert(Array{Float32},d[:, 1:end-1]))
x_train = get_feat(data_train)
x_test = get_feat(data_test)
# One hot labels
# onehot(d) = [Flux.onehot(v, unique(df.class)) for v in d[:,end]]
onehot(d) = Flux.onehotbatch(d[:,end], unique(df.class))
y_train = onehot(data_train)
y_test = onehot(data_test)
# Push data onto the GPU
# x_train = cu(x_train)
# x_test = cu(x_test)
# y_train = cu(y_train)
# y_test = cu(y_test)
md"""
Formating data for training (including onehot conversion and (NOT) moving to gpu)
"""
end
Enter cell code...
xxxxxxxxxx
Creating DataLoaders for batches
xxxxxxxxxx
begin
batch_size= 1
train_dl = DataLoader((x_train, y_train), batchsize=batch_size, shuffle=true)
test_dl = DataLoader((x_test, y_test), batchsize=batch_size)
md"""#### Creating DataLoaders for batches"""
end
Enter cell code...
xxxxxxxxxx
The model
I am going to implement a fully connected neural network to classify by species.
Layers: Chain(Dense(4, 8, relu), Dense(8, 3), softmax)
Loss: logit binary crossentropy
Optimizer: Flux.Optimise.ADAM
Learning rate: 0.001
Epochs: 30
Batch size: 1
xxxxxxxxxx
Training!
Train
acc: 0.9619047619047619
loss: 0.6096755f0
Test
acc: 1.0
loss: 0.6018428f0
xxxxxxxxxx
begin
### Model ------------------------------
function get_model()
c = Chain(
Dense(4,8,relu),
Dense(8,3),
softmax
)
# c = cu(c)
end
model = get_model()
### Loss ------------------------------
loss(x,y) = Flux.Losses.logitbinarycrossentropy(model(x), y)
train_losses = []
test_losses = []
train_acces = []
test_acces = []
### Optimiser ------------------------------
lr = 0.001
opt = ADAM(lr, (0.9, 0.999))
### Callbacks ------------------------------
function loss_all(data_loader)
sum([loss(x, y) for (x,y) in data_loader]) / length(data_loader)
end
function acc(data_loader)
f(x) = Flux.onecold(cpu(x))
acces = [sum(f(model(x)) .== f(y)) / size(x,2) for (x,y) in data_loader]
sum(acces) / length(data_loader)
end
callbacks = [
() -> push!(train_losses, loss_all(train_dl)),
() -> push!(test_losses, loss_all(test_dl)),
() -> push!(train_acces, acc(train_dl)),
() -> push!(test_acces, acc(test_dl)),
]
# Training ------------------------------
epochs = 30
ps = Flux.params(model)
epochs Flux.train!(loss, ps, train_dl, opt, cb = callbacks)
train_loss = loss_all(train_dl)
test_loss = loss_all(test_dl)
train_acc = acc(train_dl)
test_acc = acc(test_dl)
md"""
### Training!
**Train**
acc: $(train_acc)
loss: $(train_loss)
**Test**
acc: $(test_acc)
loss: $(test_loss)
"""
end
Enter cell code...
xxxxxxxxxx
Results
xxxxxxxxxx
xxxxxxxxxx
begin
x_axis = 1:epochs * size(y_train,2)
plot(x_axis, train_losses, label="Training loss",
title="Loss", xaxis="epochs * data size")
plot!(x_axis, test_losses, label="Testing loss")
end
xxxxxxxxxx
begin
plot(x_axis, train_acces, label="Training acc",
title="Accuracy", xaxis="epochs * data size")
plot!(x_axis, test_acces, label="Testing acc")
end
One example prediction:
Prediction: 0.0074335323 , 0.8525481 , 0.14001828
Truth: 0 , 1 , 0
error: 0.2949037f0
xxxxxxxxxx
Confusion matrix
xxxxxxxxxx
xxxxxxxxxx
begin
preds = round.(model(x_test))
truths = y_test
un_onehot(v) = v[1] == true ? 1 : v[2] == true ? 2 : 3
preds = [un_onehot(v) for v in eachcol(preds)]
truths = [un_onehot(v) for v in eachcol(truths)]
conf_mat = zeros(3,3)
for (y′, y) in zip(preds, truths)
if y == 1
if y′ == 1
conf_mat[1,1] += 1
elseif y′ == 2
conf_mat[1,2] += 1
else
conf_mat[1,3] += 1
end
elseif y == 2
if y′ == 1
conf_mat[2,1] += 1
elseif y′ == 2
conf_mat[2,2] += 1
else
conf_mat[2,3] += 1
end
else
if y′ == 1
conf_mat[3,1] += 1
elseif y′ == 2
conf_mat[3,2] += 1
else
conf_mat[3,3] += 1
end
end
end
# conf_mat = conf_mat ./ sum(conf_mat) # normalize
label = "setosa \t:\t versicolor \t:\t virginica"
heatmap(conf_mat, color=:plasma, aspect_ratio=1, xaxis=label, axis = nothing)
end
Enter cell code...
xxxxxxxxxx
[4] Conclusion
Platform/Tools
I chose to implement a basic feed forward neural network because of the scale of the problem. With the data set containing so few samples with very little features a small network would fit better. Again, because of the size of the problem, shallow ML approaches would have been sufficient. Something to expand on in this research is to compare to such methods.
I wanted to challenge myself and learn an entirely new language and platform for this project. The Julia Programming Language is a high level, dynamically typed language. It comes with its own web-based editor that is much like Python's Jupter notebooks. Because Julia is newer and the community is smaller than Python, the documentation and support were not even close in magnitude. This slowed me down considerably. Despite the setbacks, I learned a lot in this research and I am glad I decided to use Julia.
Results
My model's test accuracy was 95.55%. This is satisfactory for me due to the simplicity of the data set and the model. While one species was linearly seperable, the other two were not. These later species are the main problem for the model to tackle.
As I stated in the beginning of this paper, this model could be used for classification tasks such as automation or as a tool for bio researchers to aid in identification. Furthermore, this model could be used as a pre-trained model for more specific tasks; I understand this statement is a bit of a stretch but I want to account for as many applications as possible.
xxxxxxxxxx
[5] Related work
Related research: Kaggle
One thing they did, that I didn't do, is compare their deep learning model to more classical approaches.
xxxxxxxxxx
xxxxxxxxxx