The Iris data set

xxxxxxxxxx

5.4 μs

https://archive.ics.uci.edu/ml/datasets/irisReporton

xxxxxxxxxx

5.6 μs

[1] Data set description and possible applications

xxxxxxxxxx

3.1 μs

This data set contains 150 samples iris flower. The features in each sample are the length and width of both the iris petal and sepal, and also the species of iris. data = 150×5

Each feature is recorded as a floating point value except for the species (string). The species identifier acts as the labels for this data set (if used for supervised learning).There are no missing values. The data and header is seperated into two different files.

This data could be used for iris classification. This could be useful in an automation task involving these flowers or as a tool for researchers to assist in quick identification. Other, less "real world" applications include use as a data set for ML systems such as supervised learning (NN) and unsupervised learning (K-NN).

xxxxxxxxxx

4.9 μs

[2] Data summary and visualizations

xxxxxxxxxx

2.5 μs

Imports

xxxxxxxxxx

2.9 μs

 
begin
    import Pkg;
    packages = ["CSV","DataFrames","PlutoUI","Plots","Combinatorics"]   
    Pkg.add(packages)
    
    using CSV, DataFrames, PlutoUI, Plots, Combinatorics
​
    plotly()
    theme(:solarized_light)
end

9.3 s

Enter cell code...xxxxxxxxxx

270 ns

Loading, cleaning, and manipulating the data

xxxxxxxxxx

2.6 μs

Column names: sepal_len, sepal_wid, petal_len, petal_wid, class

4 rows × 8 columns

	variable	mean	min	median	max	nunique	nmissing	eltype
	Symbol	Float64	Float64	Float64	Float64	Nothing	Nothing	DataType
1	sepal_len	5.84333	4.3	5.8	7.9			Float64
2	sepal_wid	3.054	2.0	3.0	4.4			Float64
3	petal_len	3.75867	1.0	4.35	6.9			Float64
4	petal_wid	1.19867	0.1	1.3	2.5			Float64

xxxxxxxxxx
 
begin
    path = "iris/iris.data"
    csv_data = CSV.File(path, header=false)
    
    iris_names = ["sepal_len", "sepal_wid", "petal_len", "petal_wid", "class"]
    df = DataFrame(csv_data.columns, Symbol.(iris_names))
    dropmissing!(df)
    
    md"""
    **Column names:** $(join(iris_names, ", "))
    $(describe(df, cols=1:4))
    """
end

467 μs

Enter cell code...xxxxxxxxxx

294 ns

Splitting the data into three iris classes

As you can see, there is a equal representation of each class:

xxxxxxxxxx

3.8 μs

Class sizes: (50, 5), (50, 5) (50, 5)

xxxxxxxxxx
 
begin
    df_species = groupby(df, :class)
    md"""**Class sizes:** $(size(df_species[1])), $(size(df_species[2])) $(size(df_species[3]))"""
end

41.9 μs

Enter cell code...xxxxxxxxxx

276 ns

Visualizations

xxxxxxxxxx

2.6 μs

Comparing length vs width of the sepal and petal

xxxxxxxxxx

2.6 μs

Plots.jl

xxxxxxxxxx
 
begin
    scatter(title="len vs wid", xlabel = "length", ylabel="width",
             df.sepal_len, df.sepal_wid, color="blue", label="sepal")
    scatter!(df.petal_len, df.petal_wid, color="red", label="petal")
end

7.1 ms

Enter cell code...xxxxxxxxxx

287 ns

Comparing all combinations of variables

Column pairs per chart: [sepal_len, sepal_wid, petal_len, petal_wid, class]

-> [1, 2] , [1, 3] , [1, 4]

-> [2, 3] , [2, 4] , [3, 4]

xxxxxxxxxx

28.1 μs

Plots.jl

xxxxxxxxxx
 
begin
    # Get all combinations of colums
    combins = collect(combinations(1:4,2))
    combos = [(df[x][1], df[x][2]) for x in combins]
    # Plot all combinations in sub-plots
    scatter(combos, layout=(2,3))
end

55.7 ms

Enter cell code...xxxxxxxxxx

328 ns

Comparing the sepal length vs sepal width vs petal length of all three classes of iris

Restricted to three variables to plot in 3d

xxxxxxxxxx

5.4 μs

Plots.jl

xxxxxxxxxx
 
begin
    setosa, versicolor, virginica = df_species
    
    scatter(setosa[1], setosa[2], setosa[3], label="Setosa", xlabel="d")
    scatter!(versicolor[1], versicolor[2], versicolor[3], label="versicolor")
    scatter!(virginica[1], virginica[2], virginica[3], label="virginica")
end

5.2 ms

Enter cell code...xxxxxxxxxx

262 ns

[3] Deep Learning

xxxxxxxxxx

4 μs

Imports

xxxxxxxxxx

2.6 μs

MersenneTwisterseedUInt321

0x0000007b

stateRandom.DSFMT.DSFMT_statevalInt321

1464307935

1073116007

222134151

1073120226

-290652630

1072956456

-580276323

1073476387

1332671753

1073438661

-298887060

1073348697

1289889874

1073607351

1085715880

1072854758

60141430

1073707117

-1904453904

1072781553

-1535479458

1073595436

1934341771

1073230615

1429722515

1073715449

-657651521

1073293218

-929017177

1072876028

2119209372

1073158224

-953585688

1073232163

1749145680

1073583548

1096072997

1073708804

760591513

1073075131

761

138346874

762

1073030449

763

1049893279

764

1073166535

765

-1999907543

766

1597138926

767

-775229811

768

32947490

769

770

valsFloat641

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

993

0.0

994

0.0

995

0.0

996

0.0

997

0.0

998

0.0

999

0.0

1000

0.0

1001

0.0

1002

0.0

intsUInt1281

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

0x00000000000000000000000000000000

492

0x00000000000000000000000000000000

493

0x00000000000000000000000000000000

494

0x00000000000000000000000000000000

495

0x00000000000000000000000000000000

496

0x00000000000000000000000000000000

497

0x00000000000000000000000000000000

498

0x00000000000000000000000000000000

499

0x00000000000000000000000000000000

500

0x00000000000000000000000000000000

501

0x00000000000000000000000000000000

idxF

idxI

xxxxxxxxxx
 
begin
    Pkg.add("Flux")
    Pkg.add("CUDA")
    Pkg.add("IterTools")
    
    using Flux
    using Flux: Data.DataLoader
    using Flux: @epochs
    using CUDA
    using Random
    using IterTools: ncycle
    
    Random.seed!(123);
​
#   CUDA.allowscalar(false)
end

887 ms

Enter cell code...xxxxxxxxxx

274 ns

The Data

xxxxxxxxxx

2.6 μs

Formating data for training (including onehot conversion and (NOT) moving to gpu)

xxxxxxxxxx
 
begin   
    # Convert df to array
    data = convert(Array, df)
    
    # Shuffle
    data = data[shuffle(1:end), :]
​
    # train/test split
    train_test_ratio = .7
    idx = Int(floor(size(df, 1) * train_test_ratio))
    data_train = data[1:idx,:]
    data_test = data[idx+1:end, :]
​
    # Get feature vectors
    get_feat(d) = transpose(convert(Array{Float32},d[:, 1:end-1]))
    x_train = get_feat(data_train)
    x_test = get_feat(data_test)
    
    # One hot labels
    #   onehot(d) = [Flux.onehot(v, unique(df.class)) for v in d[:,end]]
    onehot(d) = Flux.onehotbatch(d[:,end], unique(df.class))
    y_train = onehot(data_train)
    y_test = onehot(data_test)
​
    # Push data onto the GPU    
#   x_train = cu(x_train)
#   x_test = cu(x_test)
#   y_train = cu(y_train)
#   y_test = cu(y_test)
    
    md"""
    Formating data for training (including onehot conversion and (NOT) moving to gpu)
    """
end

15.4 ms

Enter cell code...xxxxxxxxxx

280 ns

Creating DataLoaders for batches

xxxxxxxxxx
 
begin
    batch_size= 1
    train_dl = DataLoader((x_train, y_train), batchsize=batch_size, shuffle=true)
    test_dl = DataLoader((x_test, y_test), batchsize=batch_size)
    
    md"""#### Creating DataLoaders for batches"""
end

27.2 μs

Enter cell code...xxxxxxxxxx

251 ns

The model

I am going to implement a fully connected neural network to classify by species.

Layers: Chain(Dense(4, 8, relu), Dense(8, 3), softmax)

Loss: logit binary crossentropy

Optimizer: Flux.Optimise.ADAM

Learning rate: 0.001

Epochs: 30

Batch size: 1

xxxxxxxxxx

22.4 μs

Training!

Train

acc: 0.9619047619047619

loss: 0.6096755f0

Test

acc: 1.0

loss: 0.6018428f0

xxxxxxxxxx
 
begin
    ### Model ------------------------------
    function get_model()
        c = Chain(
            Dense(4,8,relu),
            Dense(8,3),
            softmax
        )
#       c = cu(c)
    end
    
    model = get_model()
​
    ### Loss ------------------------------
    loss(x,y) = Flux.Losses.logitbinarycrossentropy(model(x), y)
    
    train_losses = []
    test_losses = []
    train_acces = []
    test_acces = []
    
    ### Optimiser ------------------------------
    lr = 0.001
    opt = ADAM(lr, (0.9, 0.999))
​
    ### Callbacks ------------------------------
    function loss_all(data_loader)
        sum([loss(x, y) for (x,y) in data_loader]) / length(data_loader) 
    end
    
    function acc(data_loader)
        f(x) = Flux.onecold(cpu(x))
        acces = [sum(f(model(x)) .== f(y)) / size(x,2)  for (x,y) in data_loader]
        sum(acces) / length(data_loader)
    end
    
    callbacks = [
        () -> push!(train_losses, loss_all(train_dl)),
        () -> push!(test_losses, loss_all(test_dl)),
        () -> push!(train_acces, acc(train_dl)),
        () -> push!(test_acces, acc(test_dl)),
    ]
​
    # Training ------------------------------
    epochs = 30
    ps = Flux.params(model)
    
    @epochs epochs Flux.train!(loss, ps, train_dl, opt, cb = callbacks)
    
    @show train_loss = loss_all(train_dl)
    @show test_loss = loss_all(test_dl)
    @show train_acc = acc(train_dl)
    @show test_acc = acc(test_dl)
    
    md"""
    ### Training!
    **Train**   
      
      acc: $(train_acc)
      
      loss: $(train_loss)
    
    **Test**
    
      acc: $(test_acc)
      
      loss: $(test_loss)
    """
end

6.1 s

Enter cell code...xxxxxxxxxx

267 ns

Results

xxxxxxxxxx

3.6 μs

Plots.jl

xxxxxxxxxx
 
begin
    x_axis = 1:epochs * size(y_train,2)
    plot(x_axis, train_losses, label="Training loss",
        title="Loss", xaxis="epochs * data size")
    plot!(x_axis, test_losses, label="Testing loss")
end

2.5 ms

Plots.jl

xxxxxxxxxx
 
begin
    plot(x_axis, train_acces, label="Training acc",
        title="Accuracy", xaxis="epochs * data size")
    plot!(x_axis, test_acces, label="Testing acc")
end

2.4 ms

One example prediction:

Prediction: 0.0074335323 , 0.8525481 , 0.14001828

Truth: 0 , 1 , 0

error: 0.2949037f0

xxxxxxxxxx

70.6 μs

Confusion matrix

xxxxxxxxxx

2.7 μs

Plots.jl

xxxxxxxxxx
 
begin
    preds = round.(model(x_test))
    truths = y_test
    
    un_onehot(v) = v[1] == true ? 1 : v[2] == true ? 2 : 3
​
    preds = [un_onehot(v) for v in eachcol(preds)]
    truths = [un_onehot(v) for v in eachcol(truths)]
    
    conf_mat = zeros(3,3)
    for (y′, y) in zip(preds, truths)   
        if y == 1
            if y′ == 1
                conf_mat[1,1] += 1
            elseif y′ == 2
                conf_mat[1,2] += 1
            else
                conf_mat[1,3] += 1
            end
        elseif y == 2
            if y′ == 1
                conf_mat[2,1] += 1
            elseif y′ == 2
                conf_mat[2,2] += 1
            else
                conf_mat[2,3] += 1
            end
        else
            if y′ == 1
                conf_mat[3,1] += 1
            elseif y′ == 2
                conf_mat[3,2] += 1
            else
                conf_mat[3,3] += 1
            end
        end
    end
​
#   conf_mat = conf_mat ./ sum(conf_mat) # normalize
    label = "setosa \t:\t versicolor \t:\t virginica"
    heatmap(conf_mat, color=:plasma, aspect_ratio=1, xaxis=label, axis = nothing)
    
end

87.9 ms

Enter cell code...xxxxxxxxxx

293 ns

[4] Conclusion

Platform/Tools

I chose to implement a basic feed forward neural network because of the scale of the problem. With the data set containing so few samples with very little features a small network would fit better. Again, because of the size of the problem, shallow ML approaches would have been sufficient. Something to expand on in this research is to compare to such methods.

I wanted to challenge myself and learn an entirely new language and platform for this project. The Julia Programming Language is a high level, dynamically typed language. It comes with its own web-based editor that is much like Python's Jupter notebooks. Because Julia is newer and the community is smaller than Python, the documentation and support were not even close in magnitude. This slowed me down considerably. Despite the setbacks, I learned a lot in this research and I am glad I decided to use Julia.

Results

My model's test accuracy was 95.55%. This is satisfactory for me due to the simplicity of the data set and the model. While one species was linearly seperable, the other two were not. These later species are the main problem for the model to tackle.

As I stated in the beginning of this paper, this model could be used for classification tasks such as automation or as a tool for bio researchers to aid in identification. Furthermore, this model could be used as a pre-trained model for more specific tasks; I understand this statement is a bit of a stretch but I want to account for as many applications as possible.

xxxxxxxxxx

8.7 μs

[5] Related work

Related research: Kaggle

One thing they did, that I didn't do, is compare their deep learning model to more classical approaches.

xxxxxxxxxx

6.5 μs

References

xxxxxxxxxx

3 μs