I got first hands on with Apache Spark about a year ago and it seemed cool. Yet going through my updated quick notes here, I felt falling in love with it 😎 It grew much more in integration options as well as features..
- The Zeppelin IDE checks for syntax errors syntax and shows you the data as well as lets you submit jobs to a spark cluster
- Scala is the default language, but can be used from python SQL and others
- Spark is Newer than Hadoop and positioned to replace it
- Spark Optimizes data shifting using memory mapping and reduces the move data across cluster nodes using partitions
- Runs on top of jvm
- Scala is based on Functional programming where you would use X = collection of Y filter by.. Instead ofFor loop in Y If then add to X
- Spark uses RDD – Resilient Distributed Datasets: fault-tolerant collection of elements that can be operated on in parallelis and produce the data processing we want
- Spark supports many formats for the data: hive json cassandra Elasticsearch
- Spark can be used with Mlib for machine learning
- Spark Streaming allows data frames manipulations on the fly – letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python.
- SparkR let’s you interact with spark via R. Still not fully functional
- You can use those to submit Spark jobs: EMR step, Lambda, AWS pipeline, Airflow, Zeppelin, R Studio
- You can reduce cost and keep data off the cluster on S3 and by using emrfs as well
- In AWS you can hook Spark with DynamoDB RDS Kinesis and many others