New feature in AWS Redshift allows automatic audit logs shipping to S3 for complete backward analysis beyond the few days log kept inside Redshift
Amazon AWS Athena allows you run ANSI SQL directly against your S3 Buckets supporting a multitude of file formats and data formats
- No ETL needed
- No Servers or instances
- No warmup required
- No data load before querying
- No need for DRP – it’s multi AZ
Uses Presto (in memory data distributed data query engine) and HIVE (DDL table creation to reference to your S3 data)
You pay for the amount of data scanned, so you can optimize the performance as well as cost, if you:
- Compress your data
- Store it in a columned format
- Partition it
- Convert it to Parquet / ORC format
Querying in Athena:
- You can query Athena via the AWS Console (dozens of queries can run in parallel) or using any JDBC enabled tool such as SQL Workbench
- You can stream Athena queries results into S3 or AWS Quick Sight (Spice)
- Creating a table for query in Athena is merely writing a schema that you later refer to
- Table Schema you create for queries are fully managed and Highly Available
- Queries will act as the route to the data so every time you execute the Query it re-evaluates everything in the relevant buckets
- To create a partition you specify a key value and then a bucket and a prefix that points to the data that correlates with this partition
Just note that Athena serves specific use cases (such as non urgent ad-hoc queries) where other Big Data tools are used to fulfill other needs – AWS Redshift is more aimed at quickest query times for large amounts of unstructured data, where AWS Kinesis Analytics is aimed at queries of rapidly streaming data.
Want to learn more on Big Data and AWS? Visit http://allcloud.io
I have seen a peered Mongo!https://www.mongodb.com/blog/post/introducing-vpc-peering-for-mongodb-atlas
Firstly take a look at this recent AWS April 2016 Webinar Series – Migrating your Databases to Aurora and the AWS June 2016 Webinar Series – Amazon Aurora Deep Dive – Optimizing Database Performance session lead by Puneet Agarwal
- Faster recovery from instance failure (X5 times or more vs. MySQL)
- Consistent lower impact on the Primary replica
- Need additional throughput (theoretically X5 times for the same resources vs. MySQL). This was achieved by decoupling the cache and storage sub-systems and spreading them across many nodes as well as performing log commit first while DB manipulation is done asynchronously.
- Using or can migrate to MySQL 5.6
- Comfortable with the Aurora I/O mechanism (16K for read, 4K for write, all can be batched if smaller)
- Get more replicas (maximum of 15 vs. 5 in MySQL)
- Prioritise recovery replica targets and set replicas with different size than the master
- Need virtually no replication lag – since replica nodes share the same storage as the master uses
- Able to decide about encryption at rest, at DB creation
- Accept working with the InnoDB engine alone
- Want to eliminate the need for cache warming
- Allow additional 20% pricing over MySQL to gain all the above 🙂
Using a DE-Centralized (Master-Less) Puppet stack has its benefits for dynamic fast morphing environments.
Yet you’d still love to get all changes made to your environment recorded in a central repo.
Factor can be easily customized to ship new types of configuration information as your heart desires.
What are you using?
I got first hands on with Apache Spark about a year ago and it seemed cool. Yet going through my updated quick notes here, I felt falling in love with it 😎 It grew much more in integration options as well as features..
- The Zeppelin IDE checks for syntax errors syntax and shows you the data as well as lets you submit jobs to a spark cluster
- Scala is the default language, but can be used from python SQL and others
- Spark is Newer than Hadoop and positioned to replace it
- Spark Optimizes data shifting using memory mapping and reduces the move data across cluster nodes using partitions
- Runs on top of jvm
- Scala is based on Functional programming where you would use X = collection of Y filter by.. Instead ofFor loop in Y If then add to X
- Spark uses RDD – Resilient Distributed Datasets: fault-tolerant collection of elements that can be operated on in parallelis and produce the data processing we want
- Spark supports many formats for the data: hive json cassandra Elasticsearch
- Spark can be used with Mlib for machine learning
- Spark Streaming allows data frames manipulations on the fly – letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python.
- SparkR let’s you interact with spark via R. Still not fully functional
- You can use those to submit Spark jobs: EMR step, Lambda, AWS pipeline, Airflow, Zeppelin, R Studio
- You can reduce cost and keep data off the cluster on S3 and by using emrfs as well
- In AWS you can hook Spark with DynamoDB RDS Kinesis and many others