State of the ‘Spark’

I got first hands on with Apache Spark about a year ago and it seemed cool. Yet going through my updated quick notes here, I felt falling in love with it 😎 It grew much more in integration options as well as features..

  1. The Zeppelin IDE checks for syntax errors syntax and shows you the data as well as lets you submit jobs to a spark cluster
  2. Scala is the default language, but can be used from python SQL and others
  3. Spark is Newer than Hadoop and positioned to replace it
  4. Spark Optimizes data shifting using memory mapping and reduces the move data across cluster nodes using partitions
  5. Runs on top of jvm
  6. Scala is based on Functional programming where you would use X = collection of Y filter by.. Instead ofFor loop in Y  If then add to X
  7. Spark uses RDD – Resilient Distributed Datasets: fault-tolerant collection of elements that can be operated on in parallelis and produce the data processing we want
  8. Spark supports many formats for the data: hive json cassandra Elasticsearch
  9. Spark can be used with Mlib for machine learning
  10. Spark Streaming allows data frames manipulations on the fly – letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python. 
  11. SparkR let’s you interact with spark via R. Still not fully functional
  12. You can use those to submit Spark jobs: EMR step, Lambda, AWS pipeline, Airflow, Zeppelin, R Studio
  13. You can reduce cost and keep data off the cluster on S3 and by using emrfs as well
  14. In AWS you can hook Spark with DynamoDB RDS Kinesis and many others

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s