The Exciting Frontier of
Custom KSQL Functions

Hi, I'm Mitch

  • Data Systems Engineer @Mailchimp
  • mitchseymour.com
  • new dad   ❤   thai food, retrowave
Mailchimp

Agenda


  • Motivation
  • Terminology / Basics
  • Remote services / models
  • Embedded models
  • Polyglot UDF experiment
  • Summary

Motivation

Why are custom KSQL functions important?

Why are custom KSQL functions exciting?

KSQL functions are shareable

They facilitate exploration of the

current technological landscape

Let's explore

Agenda


  •   Motivation
  • Terminology / Basics
  • Remote services / models
  • Embedded models
  • Polyglot UDF experiment
  • Summary

Terminology

UDFs


  • User-defined functions
  • Operate on a single row
  • Stateless

UDAFs


  • User-defined aggregate functions
  • Multiple inputs, one output (aggregation)
  • Stateful

Example I

Basic functions


Concepts

  • Building
  • Deploying

The process of building custom KSQL functions is

easy and repeatable

Maven Archetype

Start with the business logic

Add annotations

Deploy

Verify

Invoke

What about UDAF s ?

Build and Deploy

(same as before)

Agenda


  •   Motivation
  •   Terminology / Basics
  • Remote services / models
  • Embedded models
  • Polyglot UDF experiment
  • Summary

Example II

Sentiment Analysis


Concepts

  • Remote services
  • Third party dependencies

Sentiment Analysis


  • Product reception
  • Outage impact
  • Audience engagement
  • Abusive content moderation

Natural Language API

Configs vs Environment Variables

Maximizing throughput

Example III

Coversational interfaces


Concepts

  • Exceptions
  • Evolutionary UDFs

Dialogflow

"Organizations report a reduction of up to 70 percent in call, chat and/or email inquiries after implementing a VCA" - Gartner research

Use cases


  • Chat bots
  • Virtual assistants
  • Improved customer service

Example


input sourced from user

"I would like to book a room" - user123

response generated by Dialogflow via KSQL

"I can help with that. Where would you like to reserve a room?"

hybrid training


  • Pre-trained ML models
  • User can also provide training data

How do we safely improve the model over time?

In event-driven architectures, this is easy

"By storing only the events and never the commands, we have a wealth of capability that supports evolutionary change" - Neil Avery
https://www.confluent.io/blog/journey-to-event-driven-part-1-why-event-first-thinking-changes-everything

Neil Avery

Error flows

.
                    _ ._  _ , _ ._
                  (_ ' ( `  )_  .__)
                ( (  (    )   `)  ) _)
              (__ (_   (_ . _) _) ,__)
                  `~~`\ ' . /`~~`
                        ;   ;
                        /   \
          _____________/_ __ \_____________         .
  • Fail fast
  • Fail silently
  • Dead letters

Agenda


  •   Motivation
  •   Terminology / Basics
  •   Remote services / models
  • Embedded models
  • Polyglot UDF experiment
  • Summary

Example IV

Spam detection


Concepts

  • Embedded models
  • hid billions of dollars in debt from investors through accounting fraud
  • emails made public by the Federal Energy Regulatory Commission
  • let's build a spam detector

  • training models is easy
  • models can be exported to Java classes

Let's see how easy it is to

build & export a model with h2o

Embed the model

Remote vs Embedded

Remote


  • −   Higher latency
  • −   Less predictable failures
  • −   No offline support
  • +   Simple integration
  • +   Built-in model management

Checkout Kai's anomaly detection UDF for another h2o example

https://github.com/kaiwaehner/ksql-udf-deep-learning-mqtt-iot

Agenda


  •   Motivation
  •   Terminology / Basics
  •   Remote services / models
  •   Embedded models
  • Polyglot UDF experiment
  • Summary

Example V

Ruby UDF


Concepts

  • Multilingual UDFs
  • Polyglot programming
  • Democratize UDF development for non-Java developers
  • This is experimental

Installing guest languages

Graal updater (gu)

$ gu install ruby

$ gu available

ComponentId              Version             Component name
----------------------------------------------------------------
python                   1.0.0-rc15          Graal.Python
R                        1.0.0-rc15          FastR
ruby                     1.0.0-rc15          TruffleRuby

Now, let's create a Polyglot UDF!

Gotchas


  • Need benchmarks. Initial tests show a start up penalty for some languages
  • Using libs in guest languages may not always work
  • Encountered silent and hard-to-debug failures

Possible for full integration into KSQL?

I built a Proof of Concept  (POC)

https://github.com/magicalpipelines/docker-ksql-multilingual-udfs-poc

POC


  • Multilingual UDFs in interactive mode
  • Experimental KSQL language extensions

Inline Python UDF   POC only

The POC shows that polyglot UDFs are possible...

But inline Java UDFs may come first

https://github.com/confluentinc/ksql/pull/2605

Agenda


  •   Motivation
  •   Terminology / Basics
  •   Remote services / models
  •   Embedded models
  •   Polyglot UDF experiment
  • Summary

Recap

What did we learn through these examples?


  • Bootstrapping new projects
  • Building
  • Deploying
  • Configuring
  • Error handling

Vision

We have the ingredients for a rich ecosystem

There should be a community for sharing

KSQL functions


magicalpipelines.com/luna

Simply submit some info about your function at

github.com/magicalpipelines/luna

Then others can discover your function

Now what?

Go build something exciting

Links

Questions?