Category Archives: Big Data

Karate SQL? KSQL vs. Kafka Streams

Just kidding -:) No Karate SQL I am aware of..

Naturally you would use Kafka Streams if your code runs on Java where your code requires SQL like access to the data.

“Kafka Streams is the core API for stream processing on the JVM: Java, Scala, Clojure etc. It is based on a DSL (Domain Specific Language) that provides a declaratively-styled interface where streams can be joined, filtered, grouped or aggregated using the DSL itself. It also provides functionally-styled mechanisms — map, flatMap, transform, peek, etc”

(Building a Microservices Ecosystem with Kafka Streams and KSQL
via Instapaper)

Checkout KSQL, the Kafka Streams client for cases where you want to run SQL queries vs Kafka outside a JVM.

You can set a KSQL container as a side car along with your app container and let the app act upon regular Kafka Topic events, discarding the need for the app to deal with the data query logic needed to find relevant data off the stream.

Example: Your micro service needs to act upon a new customer order. Your sidecar container will run KSQL DSL select and stream only relevant event data to your app one at a time (configurable).

KSQL will get a copy of the same data across your micro services replicas.

Sounds like fun? Well because it is!

Maybe it should be called Karate SQL after all..


If you use AWS, and need Kafka (otherwise you would use AWS Kinesis), here is a nice basic starter automation for setting Kafka on AWS.

AWS Athena says No so beautifully

Amazon AWS Athena allows you run ANSI SQL directly against your S3 Buckets supporting a multitude of file formats and data formats

Here are my insights taken from a comprehensive YouTube session lead by Abhishek Sinha

  • No ETL needed
  • No Servers or instances
  • No warmup required
  • No data load before querying
  • No need for DRP – it’s multi AZ

Uses Presto (in memory data distributed data query engine) and HIVE (DDL table creation to reference to your S3 data)
You pay for the amount of data scanned, so you can optimize the performance as well as cost, if you:

  1. Compress your data
  2. Store it in a columned format
  3. Partition it
  4. Convert it to Parquet / ORC format

Querying in Athena:

  1. You can query Athena via the AWS Console (dozens of queries can run in parallel) or using any JDBC enabled tool such as SQL Workbench
  2. You can stream Athena queries results into S3 or AWS Quick Sight (Spice)
  3. Creating a table for query in Athena is merely writing a schema that you later refer to
  4. Table Schema you create for queries are fully managed and Highly Available
  5. Queries will act as the route to the data so every time you execute the Query it re-evaluates everything in the relevant buckets
  6. To create a partition you specify a key value and then a bucket and a prefix that points to the data that correlates with this partition

Just note that Athena serves specific use cases (such as non urgent ad-hoc queries) where other Big Data tools are used to fulfill other needs – AWS Redshift is more aimed at quickest query times for large amounts of unstructured data, where AWS Kinesis Analytics is aimed at queries of rapidly streaming data.

Want to learn more on Big Data and AWS? Visit

AWS Aurora vs. AWS RDS MySQL,12 Criteria Power Check-List

Firstly take a look at this recent AWS April 2016 Webinar Series – Migrating your Databases to Aurora and the  AWS June 2016 Webinar Series – Amazon Aurora Deep Dive – Optimizing Database Performance session lead by Puneet Agarwal

Here are the main reasons I would look into for choosing AWS Aurora over AWS RDS MySQL

  1. Faster recovery from instance failure (X5 times or more vs. MySQL)
  2. Consistent lower impact on the Primary replica
  3. Need additional throughput (theoretically X5 times for the same resources vs. MySQL). This was achieved by decoupling the cache and storage sub-systems and spreading them across many nodes as well as performing log commit first while DB manipulation is done asynchronously.
  4. Using or can migrate to MySQL 5.6
  5. Comfortable with the Aurora I/O mechanism (16K for read, 4K for write, all can be batched if smaller)
  6. Get more replicas (maximum of 15 vs. 5 in MySQL)
  7. Prioritise recovery replica targets and set replicas with different size than the master
  8. Need virtually no replication lag – since replica nodes share the same storage as the master uses
  9. Able to decide about encryption at rest, at DB creation
  10. Accept working with the InnoDB engine alone
  11. Want to eliminate the need for cache warming
  12. Allow additional 20% pricing over MySQL to gain all the above 🙂