K  S  Q  L

Hi, I'm Mitch

  • Engineer @Mailchimp
  • Speaker @ Kafka Summit London
  • magicalpipelines.com
  • new dad   ❤   thai food, outrun, retrowave
Mailchimp

Agenda


  • KSQL Overview
  • Architecture
  • Dev Workflows
  • Basics / Tutorial
  • Functions
  • In Production

A very quick

Kafka Refresher

  • Topics
  • Partitions
  • Consumers
  • Producers

Topics

&

Partitions

Data is stored in topics

Topics are split into partitions

#!

                          Topic
              
              _____________________________________ 
Partition 1   |  |  |  |  |  |  |  |  |  |  |  |
              _____________________________________ 

              _____________________________________ 
Partition 2   |  |  |  |  |  |  |  |  |  |  |  |
              _____________________________________

Producers

&

Consumers

Producer
  |                                                     _______ Consumer 1
  |                                                    /
  |      ________________________________________  __ /
   --->      |  |  |  |  |  |  |  |  |  |  |  |  
         ________________________________________   <---------- Consumer 2
         
                                                    \
                                                     \
                                                      \ _______ Consumer 3

Consumer Groups

Anatomy of a message

// simplified

Timestamp  => int64
Key        => bytes
Value      => bytes

What is KSQL?

    Streaming SQL Engine


  • Continuous queries against Kafka topics
  • Familiar, SQL syntax
  • Easy to deploy
  • Highly scalable

Kafka usage @

Mailchimp

examples of

KSQL use cases

Basic

  • Reshaping data
  • Reserializing data from one format to another
  • Topic inspection, sampling, and troubleshooting

Legal

  • Data masking / filtering
  • Branching streams based on message contents or data usage permissions

Other

  • Data enrichment with joins
  • Suspicious user activies (e.g. high API usage)
  • Anomaly detection in other systems
  • Sentiment analysis of social data using custom UDFs

Agenda


  •   KSQL Overview
  • Architecture
  • Dev Workflows
  • Basics / Tutorial
  • Functions
  • In Production

Architecture

Modes

  • Interactive
  • Headless

Agenda


  •   KSQL Overview
  •   Architecture
  • Dev Workflows
  • Basics / Tutorial
  • Functions
  • In Production

Dev workflows

Basics

  • running KSQL
  • CLI
  • ksql-datagen
  • RUN SCRIPT

Running KSQL

Option 1

Download Confluent platform

https://docs.confluent.io/current/ksql/docs/installation/installing.html

Option 2

Magical Pipelines Helm Chart

https://github.com/magicalpipelines/ksql-helm-chart

CLI basics

Viewing configs

ksql> show properties ;

 Property                                               | Value
------------------------------------------------------------------------------------------
 ksql.extension.dir                                     | ext
 ksql.output.topic.name.prefix                          |
 ksql.persistent.prefix                                 | query_
 ksql.schema.registry.url                               | http://localhost:8081
 ksql.service.id                                        | dev
 ksql.sink.partitions                                   | 4
 ksql.sink.replicas                                     | 1
 ksql.sink.window.change.log.additional.retention       | 1000000
 ksql.statestore.suffix                                 | _ksql_statestore
 ksql.streams.application.id                            | KSQL_REST_SERVER_DEFAULT_APP_ID
 ksql.streams.auto.offset.reset                         | earliest
 ksql.streams.bootstrap.servers                         | localhost:9092
 ksql.streams.cache.max.bytes.buffering                 | 10000000
 ksql.streams.commit.interval.ms                        | 2000
 ...
------------------------------------------------------------------------------------------

Setting configs

ksql> SET 'ksql.udfs.enabled' = 'false';
ksql> SET 'auto.offset.reset' = 'latest';

Viewing topics

ksql> SHOW TOPICS ;

 Kafka Topic       | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
-----------------------------------------------------------------------------------------------
 _schemas          | false      | 1          | 1                  | 0         | 0
 console_purchases | true       | 4          | 1                  | 0         | 0
 game_purchases    | true       | 4          | 1                  | 0         | 0
 users             | false      | 4          | 1                  | 0         | 0
-----------------------------------------------------------------------------------------------

ksql> PRINT 'game_purchases' LIMIT 1 ;
Format:AVRO
...

ksql-datagen

  • Generating data in dev topics
  • Multiple formats (Avro, JSON, delimited)

ksql-datagen

# game_purchases
$ ksql-datagen \
    schema=game_purchase.avro \
    format=avro \
    topic=game_purchases \
    key=user_id \
    maxInterval=5000 \
    iterations=10000 &>/dev/null &

# game_consoles
$ ksql-datagen \
    schema=console_purchase.avro \
    format=avro \
    topic=console_purchases \
    key=user_id \
    maxInterval=5000 \
    iterations=10000 &>/dev/null &

The console producer is pretty handy for

pre-populating tables

$ kafka-console-producer \
    --broker-list localhost:9092 \
    --property 'parse.key=true' \
    --property 'key.separator=:' \
    --topic users < /etc/ksql/project/users.json

RUN SCRIPT

  • Run adhoc queries in CLI
  • Once you're ready, move to a SQL file
  • Use
    RUN SCRIPT
    to execute the SQL file
    in subsequent sessions

RUN SCRIPT

ksql> RUN SCRIPT 'store.sql' ;

Agenda


  •   KSQL Overview
  •   Architecture
  •   Dev Workflows
  • Basics / Tutorial
  • Functions
  • In Production

Retro Game Store

Goals

  • Combine purchases streams
  • Enrich purchase data with user info
  • Detect suspicious purchasing behavior

Streams vs Tables

  • A stream in Kafka records the full history of world (or business) events from the beginning of time to today.
    • A table in Kafka is the state of the world today

    Michael Knoll

    Stream

    .....................
    
    campaign_1  => created
    campaign_2  => created
    campaign_1  => sent
    campaign_2  => updated
    campaign_2  => sent

    Table

    .....................
    
    campaign_1  => sent
    campaign_2  => sent

    Streams

    CREATE STREAM game_purchases
    WITH (
      kafka_topic='game_purchases',
      value_format='avro'
    );
    
    CREATE STREAM console_purchases
    WITH (
      kafka_topic='console_purchases',
      value_format='avro'
    );

    Inspect

    ksql> SHOW STREAMS ;
    
     Stream Name       | Kafka Topic       | Format
    ------------------------------------------------
     CONSOLE_PURCHASES | console_purchases | AVRO
     GAME_PURCHASES    | game_purchases    | AVRO
    ------------------------------------------------
    
    ksql> DESCRIBE game_purchases ;
    
    Name                 : GAME_PURCHASES
     Field         | Type
    -------------------------------------------
     ROWTIME       | BIGINT           (system)
     ROWKEY        | VARCHAR(STRING)  (system)
     USER_ID       | VARCHAR(STRING)
     PURCHASE_ID   | VARCHAR(STRING)
     PURCHASE_TIME | BIGINT
     PRODUCT       | VARCHAR(STRING)
     PRODUCT_TYPE  | VARCHAR(STRING)
     CREDIT_CARD   | VARCHAR(STRING)
    -------------------------------------------
    For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;

    Select

    SELECT user_id, purchase_id, product, product_type
    FROM game_purchases;
    
    
    # output
    user_1000 | purchase_83 | Starfox 64 | game
    user_1000 | purchase_37 | Goldeneye 007 | game
    user_1000 | purchase_11 | Wave Race | game
    
    # query continues to run.
    user_4000 | purchase_93 | Starfox 64 | game

    Continuous

    Where does the query output go?

    It depends...

    Queries can be either:

    • Persistent
    • Non-persistent

    Persistent

    • Results are written back to Kafka
    • Connect, Streams, ... can read the output and do something with it

    Non-persistent

    • Still continuously streaming queries
    • Results aren't written to an output topic

    Non-persistent

    ksql> SELECT *
          FROM game_purchases ;

    Persistent

    ksql> CREATE STREAM game_names AS
          SELECT product as game_name
          FROM game_purchases ;
    
    
    ksql> SHOW TOPICS ;
    
     Kafka Topic       | Registered | Partitions | Partition Replicas | Consumers | ConsumerGroups
    -----------------------------------------------------------------------------------------------
     _schemas          | false      | 1          | 1                  | 0         | 0
     console_purchases | true       | 4          | 1                  | 0         | 0
     GAME_NAMES        | true       | 4          | 1                  | 0         | 0
     game_purchases    | true       | 4          | 1                  | 8         | 2
     users             | false      | 4          | 1                  | 0         | 0
    -----------------------------------------------------------------------------------------------

    Merging Streams

    CREATE STREAM all_purchases
    WITH (
      kafka_topic='purchases',
      value_format='avro'
    ) AS
    SELECT * FROM game_purchases ;
    
    INSERT INTO all_purchases SELECT * FROM console_purchases;

    Filtering

    ksql> SELECT product
          FROM console_purchases
          WHERE user_id = 'user_2000'
          LIMIT 2;
    
    
    Nintendo 64
    Sega Game Gear

    Tables

    CREATE TABLE users(
      user_id VARCHAR,
      first_name VARCHAR,
      last_name VARCHAR,
      email VARCHAR,
      address STRUCT<street VARCHAR, city VARCHAR, state VARCHAR, zip INT>)
    WITH (
      kafka_topic='users',
      value_format='json',
      key='user_id'
    );

    Tables

    SHOW TABLES ;
    
     Table Name | Kafka Topic | Format | Windowed
    ----------------------------------------------
     USERS      | users       | JSON   | false
    ----------------------------------------------

    Joins

    • Can be used for combining / enriching data
    • [
      LEFT
      |
      FULL
      |
      INNER
      ]
      JOIN

    Join example

    SELECT 
      u.user_id, u.first_name,
      u.last_name,
      p.purchase_id,
      p.product,
      p.product_type,
      p.credit_card
    FROM all_purchases p
    LEFT JOIN users u
    ON p.user_id = u.user_id
    LIMIT 3;
    
    user_1000 | Arnold | Jones | purchase_83 | Starfox 64 | game | 2237-7528-7672-6519
    user_1000 | Arnold | Jones | purchase_37 | Goldeneye 007 | game | 3349-7292-7388-6154
    user_1000 | Arnold | Jones | purchase_11 | Wave Race | game | 5326-3985-1774-2293

    Aggregations

    ksql> SELECT user_id, COUNT(*) as total_purchases
          FROM console_purchases
          GROUP BY user_id;
    
    
    # output
    user_1000 | 1
    user_4000 | 5
    user_3000 | 2
    user_5000 | 2
    user_2000 | 1

    Built-in UDFs

    CREATE STREAM purchases_formatted
    WITH (
      kafka_topic='purchases',
      value_format='avro'
    ) AS
    SELECT
      u.user_id as user_id,
      u.first_name || ' ' || u.last_name as full_name,
      p.purchase_id as purchase_id,
      p.product as product,
      p.product_type as product_type,
      MASK_LEFT(p.credit_card, 15, 'x', 'x', 'x', '-') as masked_credit_card 
    FROM all_purchases p
    LEFT JOIN users u
    ON p.user_id = u.user_id;

    Windowing

    ksql> CREATE TABLE suspicious_transactions
    WITH (
      kafka_topic='suspicious_transactions',
      value_format='avro'
    ) AS
    SELECT
      p.user_id,
      p.masked_credit_card,
      COUNT() as total_purchases
    FROM purchases_formatted p
    WINDOW TUMBLING (SIZE 10 SECONDS)
    GROUP BY
      p.user_id,
      p.masked_credit_card
    HAVING COUNT() > 3 
    LIMIT 1;

    3 types of windows

    • Hopping
    • Tumbling
    • Session

    Hopping window

    WINDOW HOPPING (SIZE 20 SECONDS, ADVANCE BY 5 SECONDS)

    Tumbling window

    WINDOW TUMBLING (SIZE 20 SECONDS)

    Session windows

    WINDOW SESSION (300 second)

    • Records with the same key are grouped into sessions
    • Sessions expire after a configurable time (e.g. 300 seconds) if no data for the key is seen
    • New records arriving after expired session will be grouped into a new session

    Time

    • Processing time
    • Event time

    Processing time

    // use the timestamp in the message
    
    Timestamp  => int64
    Key        => bytes
    Value      => bytes

    Event time processing

    ksql> CREATE STREAM api_logs (
            response_code INT,
            username VARCHAR,
            request_time LONG
          ) 
          WITH (KAFKA_TOPIC='api-logs',
                VALUE_FORMAT='JSON',
                TIMESTAMP='request_time');

    Explaining queries

    • View the topology definition

    Explain example

    ksql> show queries ;
    
     Query ID                   | Kafka Topic         | Query String
    -----------------------------------------------------------------
     CSAS_RESPONSE_CODES_ONLY_1 | RESPONSE_CODES_ONLY | ...
    ------------------------------------------------------------------
    
    
    ksql> explain CSAS_RESPONSE_CODES_ONLY_1 ;
    
    ...
    
    Processing topology
    -------------------
    Topologies:
       Sub-topology: 0
        Source: KSTREAM-SOURCE-0000000000 (topics: [api-logs])
          --> KSTREAM-MAPVALUES-0000000001
        Processor: KSTREAM-MAPVALUES-0000000001 (stores: [])
          --> KSTREAM-TRANSFORMVALUES-0000000002
          <-- KSTREAM-SOURCE-0000000000
        ...
        Sink: KSTREAM-SINK-0000000005 (topic: RESPONSE_CODES_ONLY)
          <-- KSTREAM-MAPVALUES-0000000004

    Explain Visualization

    https://zz85.github.io/kafka-streams-viz/

    Agenda


    •   KSQL Overview
    •   Architecture
    •   Dev Workflows
    •   Basics / Tutorial
    • Functions
    • In Production

    Functions

    Showing

    ksql> show functions ;
    
     Function Name           | Type
    -------------------------------------
     ABS                     | SCALAR
     ARRAYCONTAINS           | SCALAR
     CEIL                    | SCALAR
     CONCAT                  | SCALAR
     COUNT                   | AGGREGATE
     EXTRACTJSONFIELD        | SCALAR
     FETCH_FIELD_FROM_STRUCT | SCALAR
     FLOOR                   | SCALAR
     GEO_DISTANCE            | SCALAR
     IFNULL                  | SCALAR
     LCASE                   | SCALAR
     LEN                     | SCALAR
     MASK                    | SCALAR
     MASK_KEEP_LEFT          | SCALAR
     MASK_KEEP_RIGHT         | SCALAR
     MASK_LEFT               | SCALAR
     MASK_RIGHT              | SCALAR
     MAX                     | AGGREGATE
     MIN                     | AGGREGATE
     RANDOM                  | SCALAR
     ROUND                   | SCALAR
     STRINGTOTIMESTAMP       | SCALAR
     SUBSTRING               | SCALAR
     SUM                     | AGGREGATE
     TIMESTAMPTOSTRING       | SCALAR
     TOPK                    | AGGREGATE
     TOPKDISTINCT            | AGGREGATE
     TRIM                    | SCALAR
     UCASE                   | SCALAR
    -------------------------------------

    Describing

    ksql> describe function UCASE ;
    
    Name        : UCASE
    Author      : confluent
    Version     :
    Overview    :
    Type        : scalar
    Jar         : internal
    Variations  :
    
    	Arguments   : VARCHAR
    	Returns     : VARCHAR
    	Description :

    UDFs and UDAFs

    • User defined functions and User defined aggregate functions
    • Build your own custom functions for use in KSQL queries
    • Deployed as JARs inside KSQL container

    Custom functions

    @UdfDescription(name = "synthwave_name", description = "get your synthwave name")
    public class SynthwaveName {
    
      @Udf(description = "get your synthwave name")
      public String synthwaveName(String name) {
        // do something
        return "";
      }
    }
    ksql> SELECT name, synthwave_name(name)
          FROM users;
    
    # output
    anthony | synth neo
    coty | sunset laser
    erin | force grid
    garret | drive synth
    kale | night synth
    kasa | night synth
    matt | cruise synth
    mitch | cruise summer
    peony | terror force

    Agenda


    •   KSQL Overview
    •   Architecture
    •   Dev Workflows
    •   Basics / Tutorial
    •   Functions
    • In Production

    In Production

    Scaling

    • partition count
    • ksql.streams.num.stream.threads (kafka streams)
    • spec.replicas (k8s)

    Fault tolerance

    Consumer Groups

    Consumers maintain their membership by

    heart beating back to the cluster

    #!
                                                                  _
          ;#'#'.   ~'#'#;                                        |-|
         #'     `/      #'                                    _  |=|
          #.           .#                                    |-| |_|  _
          '#.         .#'  °º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤  |=|     |-|
            '#.     .#'                                      |_|     |=| 
              '#. .#'                                                |_|
                '#'
    #!
    
    -------------------------------------------------------
     ______ 
    |.     |°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 1
    | KSQL |
    | '-,  |
    |______|°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 2
    
    ------------------------------------------------------|
     ______ 
    |.     |°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 3
    | KSQL |
    | '-,  |
    |______|°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 4
    
    -------------------------------------------------------
    #!
    
    -------------------------------------------------------
     ______ 
    |.     |°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 1
    | KSQL |
    | '-,  |
    |______|°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 2
    
    ------------------------------------------------------|
                        _ ._  _ , _ ._
                      (_ ' ( `  )_  .__)
                    ( (  (    )   `)  ) _)
                  (__ (_   (_ . _) _) ,__)
                      `~~`\ ' . /`~~`
                            ;   ;
                            /   \
              _____________/_ __ \_____________
    #!
    
    -------------------------------------------------------
     ______ 
    |.     |°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 1
    | KSQL |°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 2
    | '-,  |°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 3
    |______|°º¤ø,¸¸,ø¤º°`°º¤ø,¸,ø¤°º¤ø,¸¸,ø¤ Partition 4
    
    ------------------------------------------------------|
    
                     +-------+
                     | ˥ΌSʞ  |
     ._  _ , _ .`~~  +-------+ /`~~` _ .`~~`\ '_ .`~~_  _

    Deploying

    Deploying

    Monitoring

    • JMX / Prometheus Exporter
    • Consumer lag monitoring
    • Elastalert (errors)

    Agenda


    •   KSQL Overview
    •   Architecture
    •   Dev Workflows
    •   Basics / Tutorial
    •   Functions
    •   In Production

    What now?

    # clone the repo
    $ git clone git@github.com:magicalpipelines/ksql-helm-chart.git
    
    # deploy locally
    $ ./script/build && ./script/deploy
    
    # start the CLI
    $ kubectl exec -it <POD_NAME> ksql

    Links

    Questions?