My Empty Mind

We Discuss

  •   Truth
  •   Products
  •   BigData
  •  Technology


Post Top Ad


Post Top Ad

Your Ad Spot

L2: Execution modes and Resource Managers of Spark

January 20, 2018 0
Spark has four modes of execution based on the resource manager and coordinator used for running spark job. They are:-
  1. Local Mode
  2. Standalone Mode
  3. Yarn Mode
  4. Mesos

Running Spark in Yarn is very common in Industry.

Read More

D1: PySpark - Capture bad records while loading a csv file in Spark Data Frame

January 15, 2018 2
Loading a csv file and capturing all the bad records is a very common task in ETL projects. The bad records are analyzed to take corrective or preventive measure for loading the file. In some cases, client may ask you to send the bad record file for their knowledge or action so it becomes very important to capture the bad records in these scenarios.
Most of the relational database loaders like sql loader or nzload provides this feature but when it comes to Hadoop and Spark (2.2.0) there is no direct solution for this.
However solution to this problem Is present in spark  Databricks Runtime 3.0 where you just need to provide the bad record path and bad record file will get saved there.

df =
  .option("badRecordsPath", "/data/badRecPath")

However, in the previous spark releases this method won’t work. We can achieve this in two ways :-
  1. Read file as RDD and then use the RDD transformation methods to filter the bad records
  2. Use

In this article we will see how we can capture bad records through In order to load a file and capture bad records we need to perform the following steps:-

  1. Create schema (StructType) for the feed file to load with an extra column of string type(say bad_records) for corrupt records.
  2. Call method with all the required parameters. Pass the bad record column name (extra column created in step 1 as parameter columnNameOfCorruptRecord.
  3. Filter the records where “bad_records” is not null and save it as a temp file.
  4. Read the temporary file as csv ( and pass the  same schema as above(step 1)
  5. From the bad dataframe Select “bad_column”.

Step 5 will give you a dataframe having all the bad records.


>>> >>> >>>
#####################Create Schema#####################################
>>> customSchema = StructType(      [ \
                                StructField("order_number", IntegerType(), True), \
... ...                                 StructField("total", StringType(), True),\
...                                 StructField("bad_record", StringType(), True)\
...                             ]\
...                     )
“bad_record” is the bad record column.

>>> orders_df = \
...                 .format('com.databricks.spark.csv') \
...                 .option("badRecordsPath", "/test/data/bad/")\
... ...                 .option("columnNameOfCorruptRecord", "bad_record")\
...                 .options(header='false', delimiter='|',) \
                .load('/test/data/test.csv',schema = customSchema)...

After calling, If any record doesn’t satisfy the schema then null will be assigned to all the column and a concatenated value of all columns will be assigned to the bad record column.
|order_number|        total         |                  bad_record|
|                       1|                 1000|                                  null|
|                       2|                 4000|                                  null|
|                  null|                    null|                     A|30|3000|

Here all the records were bad_record is not null shows that these records violated the schema.

Corrupt record columns are generated in run time when DataFrames instantiated and data is actually fetched (by calling any action).
Output of corrupt column depends on other columns which are a part of RDD in that particular ACTION call.
If error causing column is not a part of the ACTION call then bad_column wont show any bad record.
If you want to overcome this issue and want the bad_record to persist then follow step 3,4 and 5 or use caching.

Read More

Bitcoin : A simple explanation in layman terms

January 07, 2018 0
Bitcoin has become a buzzword now-a-days. There has been a remarkable surge in price of bitcoin from a few dollars in 2010 to over $19,000 each in the last couple of years, so whats so special in bitcoin??
Why Bitcoin and what so special about it?
  1. A currency which never reveals the identity of the owners, this reason makes bitcoin so popular, and this only is the reason for getting this currency banned in some countries 
  2. Bitcoin is not regulated by government or any bank, making it impossible for the government or any third party to control or manipulate it.
  3. New Bitcoin can be mined by anybody.

What is Bitcoin?
Bitcoin is a decentralized concurrency that uses rules of cryptography for regulation and generation.  It has a market cap of 21 billion coins causing their production to decrease and making it more valuable with time. As of now more than half of bitcoins have been generated.

Bitcoin is very similar in certain aspects to the e-wallet which we have in our mobile. People keep money in online wallets like Paytm, Ola, payzaap, or any other mobile wallets for online shopping or for buying a service, same can be done with a bitcoins as well. You can also send bitcoins to someone you want, as if you are sending money to them. Though it has lots of similarity with other online wallet but it is very different from them in many aspects such as: -

  1. Common mobile wallets store money in terms of currency like rupees, dollars etc. But bitcoin is itself a unit.  
  2. Those currencies are government recognized, but Bitcoin is not regulated by any government or Bank. They bypass government and bank regulations.
  3. The bitcoin transaction is anonymous and secret, identity of people involved in transaction is not revealed.
  4. bitcoin wallet can be stored online or offline (in USB)

Some technical jargon used in the world of bitcoin: -

  • bitcoin: - A cryptocurrency.
  • Bitcoin: - The network and the software and the system which regulates manages and controls bitcoin.
  • Wallet: - A wallet is a small personal database that you store on your computer drive, on your smartphone, on your tablet, or somewhere in the cloud.
  • Block: - A bunch of transaction on the network,
  • Transaction: - Transfer of money from one wallet to another.
  • Block chain: It’s a ledger, a final entry or final summarized report. It is open to public and has the details of transaction. Network of computer running bitcoin software maintain these block chains . All bitcoin transaction are logged and made available to public. It records every transaction and the ownership of every bitcoin in network
  • Miners: - Control the network by verifying transaction, people who mine these coins are called miners. It can be mined by anyone who has a computer. bitcoin mining involves solving of complex mathematical problem. Miners ensure that the transaction is secure and is getting processed safely.

How to obtain bitcoins?

You can obtain bitcoin by the following three methods: -
  1. Purchasing it through a bitcoin exchange.
  2. Accept it as a payment of service or goods you offer.
  3. Mining new coins.
     "Mining" is a term used to refer the discovery of new bitcoins. Mining process is simply the verification of bitcoin transactions happening across the Bitcoin network.

Suppose you buy a book, a product or a service from an online store which accepts bitcoin and you pay the money in bitcoin. To check the authenticity of the bitcoin, miners begin to verify the transaction. All the transactions are grouped into boxes with a virtual lock on these boxes called "block chains."

Miners run software to find the key that will open that virtual lock. If the key is found the transactions are verified. The current number of attempts to find the correct key is 1,789,546,951.05, according to—a top site for the real-time bitcoin transactions. Miner gets a reward of newly generated bitcoins (perhaps 12.5 bitcoins) for finding the key.Every 210,000 blocks, or, roughly, every four years, the block reward is halved. It started at 50 Bitcoin per block in 2009, and in 2014 it was halved to 25 Bitcoins per block.

And as I said bitcoins can be mined by anybody, to do so you just need powerful computation engines with top quality hardware’s with that you can pitch into the Bitcoin to verify the transactions by doing complex mathematical computation to find the right key for the block. When any one miner succeeds in solving their math problem, they get to create a new block and receive a certain number of Bitcoins as a reward, known as “the block reward.” If You don’t want to invest much on purchasing new powerful machine, then you can join the network by adding your computer to the mining pool Pools are a collective group of bitcoin miners who pool their computer to mine bitcoin. Sites such as, Slush’s Pool allow small miners to receive portion of bitcoins if they add their computer to the group

However, In early years of bitcoin mining with personal computers was possible. Now, the network is very competitive so using specialized hardware is the only way to earn.
Many online wallets are available on internet where you don't have to maintain the bitcoin software on your devise though you can also download the software and manage it locally in your computer of device.

From where to buy bitcoins: -
There are many online bitcoin exchanges available where you can open your bitcoin Wallet account and start doing transactions. Zebpay is one of the android mobile bitcoin wallet. You can use the following :- 
my referral code- REF30118675
or link-
to get a free bitcoin wallet and earn free bitcoin worth Rs. 100.

Read More

L1: Introduction to Apache Spark

January 07, 2018 0
Apache spark is a Scheduling Monitoring and Distribution engine which does lightning fast fault tolerant in-memory* parallel processing of data. It came out of the Apmlab project UC Berkeley. Apache Spark was developed as a unified engine to meet all the needs of a big data processing. 

Spark core uses both memory and disk while processing the data, it had 4 traditional APIs, Scala, Java, Python and R(experimental phase) but now it also has a new one called data-frame API(introduced in 1.3). Around Spark core there are high level libraries like SparkSQL, GraphX, Streaming, MLlib etc.

Spark has four modes of execution based on the resource manager and coordinator used for running spark job. They are:-
  1. Local Mode
  2. Standalone Mode
  3. Yarn Mode
  4. Mesos
Running Spark in Yarn is very common in Industry.
To know more on Resource Managers and Execution modes in Spark click L2: Resource Managers and different execution modes of Apache Spark.

Read More

L3: Python or Scala? Which one to choose for Apache Spark?

December 25, 2017 0

When I started learning Spark I was not sure which language to choose Python or Scala? I was a PL-SQL developer, Python and Scala both were a new language for me and I was not even aware of the market trend and market requirement on Apache Spark. I started asking people around me and spend considerable amount on time on google to know which language to choose for Apache Spark. 
Finally I came to a conclusion and wanted to share it with all who are beginner in Spark or who confused on which language to choose. My analysis is based on my own experience and talking to people in Industry from India and US. Without further ado here is my few cents which will help you in deciding “which language to choose for Apache Spark”.

Popular language used in Industry for data analysis: -
Dominance of python in areas like data science, machine learning, deep learning is unparalleled. Python is very popular amount data scientists and because of tons of libraries its really hard to beat Python in Data analysis. 

The hottest technology TenserFlow is written in Python. Python is used in broad range of scenarios e.g. scientific and Numeric, Machine learning, software application and business application development, data mining, cross platform development, RAD (Rapid Application development). 

So Learning python will broaden the scope of your career.

Performance of PySpark :-
In Spark version 1.0.X we only had RDDs to work on, but with Spark 2.0.X we have the power of DataFrame. Using DataFrames the runtime performance of running a job in Spark using Python or Scala is same, Scala and Python DataFrames are compiled into JVM bytecodes so there is negligible performance difference. Python DataFrame operations are 5 times faster than Python RDD operations   

However in actual project you may sometime need to work on RDDs but that can be easily handled.

Reluctance of ETL Developers for Scala: -
With the increase in popularity of Hadoop and trust which it is building in delivering a powerful, reliable and cheaper data processing solution most of the big industry players are now thinking of implementation or re-platforming of the existing ETL projects to Hadoop. Data Analysis projects has lots of ETL jobs to process and load the data to the data marts or warehouse. Most of the ETL projects I came across had lots of Shell, Perl or other scripts for the jobs. In last few years there is a swing towards Python. Python is not only the dominant alternative to Perl and shell scripts but is also  a powerful language

People in Industry think of Python a better solution over Perl and Shell script because of its presence, power, rich community and steep learning curve.   

Ease of Learning and Productivity Graph: -
Python is both functional and object oriented which makes it both easier and robust. For a person of PLSQL background Python will certainly be his choice. Python is easy to learn and has a steady learning curve as compared to other programming languages which have a very steep learning curve.
Ease of development is also there because of presence of wide python community.

Its really Easy!!!!
The only thing you need to start Python is just start coding in Python and a browser tab to do google search.

Mastering Spark? Is this what you want? :-
Spark is written in Scala so if you know Scala it will let you understand and modify Spark internal code. Since Big Data is still evolving you will encounter many use cases where there is no direct solution available, to achieve that you will either have to choose a tedious way of achieving it or understand the spark internal and modify it if required to fit your use case.
A good example of this scenario is reading csv in spark through read.csv() and capturing all the bad records of a csv along with error message, record number and bad column value. In Spark 2.0 there is no straight way to do this. (solution to this scenario is explained in other post)

If you come across any bug in spark code you can fix only if you know Scala. e.g. DataFrameWriter.saveAsTable issue with hive format to create partitioned table

So if you want to master Spark you  will have to know Scala.

Conclusion: -
1.       If you are a beginner and you don’t have specific requirement to learn a particular language then go for Python, Python is easy, it has a steady learning curve and so will be your spark learning. You will be good spark developer in very less time with Python. Once this is done you will be in a good position to decide whether to go for Scala or you are happy with your career in PySpark.

2.      If you know python, companies working on data science (with spark), biotech software will certainly prefer you.

3.      I see a growing trend of migration of ETL projects from other languages (perl ,shell) to Python, so it will be good to choose Python at this point.

4.      It’s really easy, no extra efforts required. The only thing you need is just start coding in Python and a browser tab to do google search.
Read More

Java: Singleton Design Pattern

November 24, 2017 0

Singleton design pattern belongs to creational design pattern family. this pattern is used to create object. In this pattern only one object is created across java virtual machine (JVM) and this object can be used by all classes.
There are many ways we can implement this pattern.

1) Eager Initialization: In this method of creation, Object of class is created when it is loaded to memory by JVM. this can be done by assigning instance to reference variable directly.

 // Java code to create singleton class by   
 // Eager Initialization  
 public class SingletonTest   
  // public instance initialized when loading the class  
  public static SingletonTest obj = new SingletonTest();  
  private SingletonTest()  
   // code for private constructor  

2) Using Static Block : this is same as Eagar Initialization , the only difference is Object is created in static block. it will help to handle exception , if any occured.

 // Java code to create singleton class  
 // Using Static block  
 public class SingletonTest   
  // public instance  
  public static SingletonTest obj;  
  private SingletonTest()   
   // private constructor  
   // static block to initialize instance  
   obj = new SingletonTest();  

Read More

Java Memory Management

November 11, 2017 0

Java provide excellent feature called garbage collection, it allows developer to create object without any worries.

Java only take care of memory allocation and de-allocation (In C/C++ developer has to take care of Object memory allocation and memory de-allocation). 
important work of Garbage Collection is to free unwanted Object space.

As you see the below image JVM memory is divided different part, At high level it is divided in two major parts
  1. Young generation
  2.  Old generation

Young generation - Young generation is area where all new objects are created. When this area is filled garbage collection is performed.this garbage collection is called Minor GC. young generation is further divided in to different parts.
  • Eden Memory  
  • Survivor memory spaces  (S1 and S0 as shown in Above Image)
How garbage collection works in Eden Memory?
Most of newly created objects are created in Eden memory space. When Eden space is filled with newly created objects then minor GC is performed and all objects are moved to one of Survivor space, at same time minor GC also check survivor space and available objects move them to Other survivor space.
Objects which are Lived after many cycles of GC, are moved to Old generation memory space.

How garbage collection works in Old Generation Memory?
Old generation memory contains objects which are long lived and survived after many cycles of Minor GC. when Old generation memory space is full with Objects, then garbage collection is performed, this garbage collection is called Major GC. There is one drawback of Major GC, when garbage collection is performed 
all threads are stopped until operation completes.

Permanent generation
permanent generation or 'Perm Gen' contains application metadata required by JVM to explain methods and classed used in application. 
Perm gen is filled by JVM at run time.

Read More