githublinkedinemail
← /articles

You don't understand Machine Learning until you implement it from scratch

Calling sklearn.fit() isn't understanding. Building a perceptron, logistic regression and a net that solves XOR in Ruby is. ML is algebra and optimization — not magic.

5 minml

You don't understand Machine Learning until you implement it from scratch

There's a huge difference between:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression().fit(X, y)

…and knowing what those two lines are doing.

The first is usage. The second is understanding.

And the whole industry confuses the two.


The fit() illusion

You imported. You called .fit(). The model trained. 92% accuracy.

Did you "do ML"?

No. You used the interface.

┌──────────────────────────────────────┐
│   model.fit(X, y)                    │
│                                      │
│   what you see: 1 line               │
│   what's inside: 5000 lines          │
└──────────────────────────────────────┘

Inside that line:

  • weight initialization (with what strategy?)
  • choice of activation function
  • forward pass
  • loss computation
  • backpropagation
  • weight update with some optimizer
  • stopping criterion
  • regularization

You chose none of that. The default chose for you.

Then comes the day the model doesn't converge. And you don't even know where to start.


My experience: an ML lib in Ruby

I decided to learn ML the way I learned everything in programming: by writing, not by reading.

I didn't care if it would be slow. I didn't want a gem. I wanted to see the algorithm bleed.

In Ruby. Plain arrays. No NMatrix, no Numo, nothing.

The path was this:

1. Perceptron           → got "neuron" and linear boundary
2. Logistic regression  → got probability and sigmoid
3. 1-hidden-layer net   → got backprop
4. XOR                  → got why a net needs non-linearity

Each step broke a different illusion.


Step 1 — Perceptron

A perceptron is a single neuron.

     x1 ──w1──┐
              ├──[ Σ ]──[ step ]── y
     x2 ──w2──┘
              ↑
              b

Weights w, bias b, weighted sum, step function on the output.

class Perceptron
  def initialize(n_inputs, lr: 0.1)
    @w = Array.new(n_inputs) { rand(-0.5..0.5) }
    @b = 0.0
    @lr = lr
  end

  def predict(x)
    z = x.zip(@w).sum { |xi, wi| xi * wi } + @b
    z >= 0 ? 1 : 0
  end

  def train(xs, ys, epochs: 20)
    epochs.times do
      xs.zip(ys).each do |x, y|
        pred = predict(x)
        error = y - pred
        @w = @w.zip(x).map { |wi, xi| wi + @lr * error * xi }
        @b += @lr * error
      end
    end
  end
end

# logical AND
xs = [[0,0],[0,1],[1,0],[1,1]]
ys = [0, 0, 0, 1]

p = Perceptron.new(2)
p.train(xs, ys)
xs.each { |x| puts "#{x.inspect} → #{p.predict(x)}" }

Works for AND. For OR. For any linearly separable problem.

Then came the slap.


Step 2 — XOR breaks everything

I tried training for XOR:

xs = [[0,0],[0,1],[1,0],[1,1]]
ys = [0, 1, 1, 0]

Doesn't converge. Ever.

And that's when I understood, for the first time, what linear separability means.

Plot XOR on a plane:

   x2
    │
  1 ●   ○      ● = class 1
    │           ○ = class 0
    │
  0 ○   ●
    │
    └────────── x1
        0   1

There's no straight line separating the ●s from the ○s. None.

A perceptron is a line. So a perceptron can't solve XOR.

This is the problem that killed the AI hype of the 70s.

And until I implemented it, it was just a textbook curiosity.

After implementing it, it became knowledge.


Step 3 — Logistic regression

Swapping step for sigmoid solves something important: the output becomes a probability.

sigmoid(z) = 1 / (1 + e^-z)

  1.0 │            ___________
      │         __/
  0.5 │      __/
      │   __/
  0.0 │__/
      └──────────────────────── z

And since sigmoid is differentiable, you can actually use gradient descent.

class LogReg
  def initialize(n_inputs, lr: 0.1)
    @w = Array.new(n_inputs, 0.0)
    @b = 0.0
    @lr = lr
  end

  def sigmoid(z) = 1.0 / (1.0 + Math.exp(-z))

  def predict_proba(x)
    z = x.zip(@w).sum { |xi, wi| xi * wi } + @b
    sigmoid(z)
  end

  def train(xs, ys, epochs: 1000)
    n = xs.length.to_f
    epochs.times do
      dw = Array.new(@w.length, 0.0)
      db = 0.0
      xs.zip(ys).each do |x, y|
        p = predict_proba(x)
        err = p - y
        @w.each_index { |i| dw[i] += err * x[i] }
        db += err
      end
      @w.each_index { |i| @w[i] -= @lr * dw[i] / n }
      @b -= @lr * db / n
    end
  end
end

Still linear. Still doesn't solve XOR. But now you have probability, gradient descent, and a differentiable loss.

One thing is missing: depth.


Step 4 — One hidden layer solves XOR

Add a layer of neurons in the middle:

   x1 ─┐    ┌──[ h1 ]──┐
       ├────┤          ├──[ y ]
   x2 ─┘    └──[ h2 ]──┘

Each h is a non-linear transformation of the input. The output combines the hs.

Now the net can draw two lines internally and combine them. XOR becomes separable.

class TinyNet
  # 2 inputs → 2 hidden (sigmoid) → 1 output (sigmoid)
  def initialize(lr: 0.5)
    @w1 = Array.new(2) { Array.new(2) { rand(-1.0..1.0) } } # 2x2
    @b1 = Array.new(2, 0.0)
    @w2 = Array.new(2) { rand(-1.0..1.0) }                  # 2
    @b2 = 0.0
    @lr = lr
  end

  def sig(z) = 1.0 / (1.0 + Math.exp(-z))
  def dsig(s) = s * (1 - s)

  def forward(x)
    @h = Array.new(2) do |j|
      sig(x[0] * @w1[j][0] + x[1] * @w1[j][1] + @b1[j])
    end
    @y = sig(@h[0] * @w2[0] + @h[1] * @w2[1] + @b2)
  end

  def train(xs, ys, epochs: 10_000)
    epochs.times do
      xs.zip(ys).each do |x, target|
        # forward
        out = forward(x)

        # backward (chain rule)
        d_out = (out - target) * dsig(out)
        d_h = Array.new(2) { |j| d_out * @w2[j] * dsig(@h[j]) }

        # updates
        2.times { |j| @w2[j] -= @lr * d_out * @h[j] }
        @b2 -= @lr * d_out

        2.times do |j|
          @w1[j][0] -= @lr * d_h[j] * x[0]
          @w1[j][1] -= @lr * d_h[j] * x[1]
          @b1[j]    -= @lr * d_h[j]
        end
      end
    end
  end
end

xs = [[0,0],[0,1],[1,0],[1,1]]
ys = [0, 1, 1, 0]

net = TinyNet.new
net.train(xs, ys)
xs.each { |x| puts "#{x.inspect} → #{net.forward(x).round(3)}" }

Expected output:

[0, 0] → 0.04
[0, 1] → 0.96
[1, 0] → 0.96
[1, 1] → 0.05

Solved XOR. In Ruby. In 60 lines. With no gem at all.

And only then did I understand backprop. Because I wrote the chain rule by hand.


What fit() hides

Look at everything that showed up in this exercise:

┌─ forward pass ─────────────────────────┐
│  composition of non-linear functions   │
└────────────────────────────────────────┘
┌─ loss ─────────────────────────────────┐
│  how the error is measured             │
└────────────────────────────────────────┘
┌─ backprop ─────────────────────────────┐
│  chain rule across layers              │
└────────────────────────────────────────┘
┌─ weight init ──────────────────────────┐
│  zero breaks. small random works       │
└────────────────────────────────────────┘
┌─ activation ───────────────────────────┐
│  step, sigmoid, tanh, ReLU…            │
│  each one changes what the net can do  │
└────────────────────────────────────────┘
┌─ learning rate ────────────────────────┐
│  big α explodes, small α stalls        │
└────────────────────────────────────────┘

All of that disappears behind one sklearn line.

You'll never understand ML staring at that line.


Senior vs junior

  • junior imports a model, tunes hyperparameters until accuracy goes up, and moves to the next task
  • senior knows, for each line of fit(), which operation is happening and why

The junior treats the model as a black box.

The senior treats it as algebra with a loop.

Salary difference. And difference in the kind of problem each can actually solve.


Why this changes everything

When you implement it:

  • you understand why normalizing data matters (gradient explodes otherwise)
  • you understand why initialization matters (zero makes every neuron learn the same thing)
  • you understand why ReLU won (sigmoid saturates, gradient vanishes)
  • you understand why depth helps (more composition = more expressive power)
  • you understand why big nets need lots of data (more parameters = more degrees of freedom)

None of that lands by reading a blog post.

It lands by writing the code.


Conclusion

Machine Learning is not magic.

It's linear algebra, calculus, and optimization — orchestrated in a loop.

And you only see that when you write the loop with your own hands.

sklearn, PyTorch, TensorFlow are excellent tools — once you understand.

Before that, they're layers hiding what you needed to learn.

If you've never implemented a perceptron, a logistic regression, and a one-hidden-layer net, you don't know ML.

You know the API.

And APIs change. Fundamentals don't.

You don't understand Machine Learning until you implement it from scratch
You don't understand Machine Learning until you implement it from scratch