Python or Scala for Apache Spark? - My Empty Mind


Python or Scala for Apache Spark?

When I started learning Spark I was not sure which language to choose Python or Scala? I was a PL-SQL developer, Python and Scala both were a new language for me and I was not even aware of the market trend and market requirement on Apache Spark. I started asking people around me and spend considerable amount on time on google to know which language to choose for Apache Spark. 
Python or Scala for Apache Spark?
Finally I came to a conclusion which wanted to share it with all who are beginner in Spark or who are confused on the language to be chosen for Spark . 

My analysis is based on my experience and talking to people in Industry from India and US. Without further ado here is my few cents which will help you in deciding “which language to choose for Apache Spark”.

Popular language used in Industry for data analysis: -
Dominance of python in areas like data science, machine learning, deep learning is unparalleled. Python is very popular amount data scientists and because of tons of libraries its really hard to beat Python in Data analysis. 

The hottest technology TenserFlow is written in Python. Python is used in broad range of scenarios e.g. scientific and Numeric, Machine learning, software application and business application development, data mining, cross platform development, RAD (Rapid Application development). 

So Learning python will broaden the scope of your career.

Performance of PySpark :-
In Spark version 1.0.X we only had RDDs to work on, but with Spark 2.0.X we have the power of DataFrame. Using DataFrames the runtime performance of running a job in Spark using Python or Scala is same, Scala and Python DataFrames are compiled into JVM bytecodes so there is negligible performance difference. Python DataFrame operations are 5 times faster than Python RDD operations   

Spark Python vs Scala Performance

However in actual project you may sometime need to work on RDDs but that can be easily handled.

Reluctance of ETL Developers for Scala: -
With the increase in popularity of Hadoop and trust which it is building in delivering a powerful, reliable and cheaper data processing solution most of the big industry players are now thinking of implementation or re-platforming of the existing ETL projects to Hadoop. Data Analysis projects has lots of ETL jobs to process and load the data to the data marts or warehouse. Most of the ETL projects I came across had lots of Shell, Perl or other scripts for the jobs. In last few years there is a swing towards Python. Python is not only the dominant alternative to Perl and shell scripts but is also  a powerful language

People in Industry think of Python a better solution over Perl and Shell script because of its presence, power, rich community and steep learning curve.   

Ease of Learning and Productivity Graph: -
Python is both functional and object oriented which makes it both easier and robust. For a person of PLSQL background Python will certainly be his choice. Python is easy to learn and has a steady learning curve as compared to other programming languages which have a very steep learning curve.
Ease of development is also there because of presence of wide python community.

Its really Easy!!!!
The only thing you need to start Python is just start coding in Python and a browser tab to do google search.

Mastering Spark? Is this what you want? :-
Spark is written in Scala so if you know Scala it will let you understand and modify Spark internal code. Since Big Data is still evolving you will encounter many use cases where there is no direct solution available, to achieve that you will either have to choose a tedious way of achieving it or understand the spark internal and modify it if required to fit your use case.
A good example of this scenario is reading csv in spark through read.csv() and capturing all the bad records of a csv along with error message, record number and bad column value. In Spark 2.0 there is no straight way to do this. (solution to this scenario is explained in other post)

If you come across any bug in spark code you can fix only if you know Scala. e.g. DataFrameWriter.saveAsTable issue with hive format to create partitioned table

So if you want to master Spark you  will have to know Scala.

Conclusion: -
    1. If you are a beginner and you don’t have specific requirement to learn a particular language then go for Python, Python is easy, it has a steady learning curve and so will be your spark learning. You will be good spark developer in very less time with Python. Once this is done you will be in a good position to decide whether to go for Scala or you are happy with your career in PySpark.

    2. If you know python, companies working on data science (with spark), biotech software will certainly prefer you.

    3. I see a growing trend of migration of ETL projects from other languages (perl ,shell) to Python, so it will be good to choose Python at this point.

    4. It’s really easy, no extra efforts required.

To learn more on Spark please click here. Let us know your views or feedback on Facebook or Twitter @MyemptymindC.

No comments:

Post a Comment