Tag Archives: AWS Athena

AWS Athena says No so beautifully

Amazon AWS Athena allows you run ANSI SQL directly against your S3 Buckets supporting a multitude of file formats and data formats

Here are my insights taken from a comprehensive YouTube session lead by Abhishek Sinha

  • No ETL needed
  • No Servers or instances
  • No warmup required
  • No data load before querying
  • No need for DRP – it’s multi AZ

Uses Presto (in memory data distributed data query engine) and HIVE (DDL table creation to reference to your S3 data)
You pay for the amount of data scanned, so you can optimize the performance as well as cost, if you:

  1. Compress your data
  2. Store it in a columned format
  3. Partition it
  4. Convert it to Parquet / ORC format

Querying in Athena:

  1. You can query Athena via the AWS Console (dozens of queries can run in parallel) or using any JDBC enabled tool such as SQL Workbench
  2. You can stream Athena queries results into S3 or AWS Quick Sight (Spice)
  3. Creating a table for query in Athena is merely writing a schema that you later refer to
  4. Table Schema you create for queries are fully managed and Highly Available
  5. Queries will act as the route to the data so every time you execute the Query it re-evaluates everything in the relevant buckets
  6. To create a partition you specify a key value and then a bucket and a prefix that points to the data that correlates with this partition

Just note that Athena serves specific use cases (such as non urgent ad-hoc queries) where other Big Data tools are used to fulfill other needs – AWS Redshift is more aimed at quickest query times for large amounts of unstructured data, where AWS Kinesis Analytics is aimed at queries of rapidly streaming data.

Want to learn more on Big Data and AWS? Visit http://allcloud.io