In this tutorial, we will discuss big data, its different dimensions, layers, and finally Big Data testing and its challenges. So, first of all, let’s understand – what is big data.
What is Big Data?
Big data term is used for large data sets of structured, semi-structured, and unstructured data that is collected from different sources by the organizations. The amount of data is so huge and complex that traditional data processing software systems can not handle and process it.
Structured data sets contain data in a definite format e.g. RDBMS database.
Unstructured data sets can be text, images, audio, or video.
Semi-structured data sets do not follow any specific format or structure e.g. log files.
All these data sets can be analyzed and used in various decision-making processes by the organizations.
Big Data Dimensions
The big data concept focuses on 3Vs or 3 dimensions as discussed below; the testing team needs to focus on each of these dimensions while testing.
- Volume – ‘Volume’ dimension represents the large amount of data collected by the organization from different sources that keeps increasing day by day exponentially. These data sets will have to be validated for its quality i.e. correctness but it is very cumbersome and time-consuming to perform this validation check manually.
So now testing teams are coming up with the idea of writing scripts and executing them in parallelly to compare two sets of data after converting them in the desired format.
- Velocity – This dimension represents the speed at which data is generated or gathered. Data sets need to be analyzed in real-time as soon as they are available. This calls for performance testing; performance testing is used to check the system’s ability when it comes to managing high-velocity data.
- Variety – As the data sets are collected from different sources, each data set will have different formats that can be structured, unstructured, or semi-structured. Managing unstructured data using traditional methods is not possible.
Some people add two more Vs to make it a 5Vs or 5 dimensions.
- Veracity – As the data is collected from different sources, it may be unverified and repetitive. Such data cannot be used until it is cleaned and is authentic.
- Value – This dimension represents the ability of deriving value by analyzing the collected data.
Layers of Big Data
Big data platform has four layers i.e. the data must be passed through each of them. The following are the four layers:
- Data Source Layer – In this first stage, data is collected from various sources such as e-mails, existing databases, transactions, social networking sites, etc. Before gathering the data, the organization should assess the need for having this data and make sure whether the sources are enough to collect the data.
- Data Storage Layer – Once the data is collected, it is stored in this layer. To store a large set of data, a proper system needs to be used that is secure from any threats and easily manageable. One such file system is HDFS i.e. Hadoop Distributed File System.
- Data Processing or Analysis Layer – In this layer, the data collected and stored in the above layers will be analyzed, and useful insights and trends will be identified. HDFS uses the MapReduce tool to analyze the data.
- Data Output Layer – Once data has been analyzed, it will be communicated or displayed to the users in the form of charts, graphs, or any other form.
Big Data Testing
As we discussed in 5Vs of the big data, to overcome these challenges, an open-source framework ‘Hadoop’ has been developed. Hadoop is used to store and process large sets of data and uses its file system HDFS (Hadoop Distributed File System). From this point onwards, this tutorial explains the big data testing focused on Hadoop framework.
Big data testing can and should be performed in two ways, functional and non-functional testing, as explained below-
This type of testing is divided into the following three stages-
1. Testing of Loading Data into HDFS (Pre-Hadoop Process) – Big data systems have structured, unstructured and semi-structured data i.e. data is in different formats and it is collected from different sources. This data is stored in HDFS.
The checklist for this stage is as follows-
- Check whether the data collected from the source is uncorrupted and accurate.
- Check whether the data is complete, and it does not contain any duplicates.
- Check whether the data files are stored in the correct location.
- Check source data with the data in HDFS to make sure they match.
2. MapReduce Operations – Once the data is stored into the HDFS, it needs to be analyzed; for this purpose, the MapReduce tool is used. MapReduce helps in processing the data in parallel on multiple nodes. MapReduce tool performs two important tasks i.e. Map and Reduce.
These tasks are separated into various jobs and these jobs are assigned to nodes in the network. Both the input and output of these tasks are stored in HDFS.
In the Map task, an individual set of data is processed by the mapper, and smaller multiple chunks of data are produced in the form of (key, value) pairs.
In the Reduce task, two steps Shuffle and Reduce are included. The reducer processes the data coming from the mapper and generates new aggregated set(s) of output in the form of (key, value) pairs. This process can be asked in Hadoop interview questions or big data interview questions.
The checklist for this stage is as follows-
- Check the business logic on the nodes.
- Check whether the MapReduce process is generating correct (key, value) pairs.
- Check the aggregated data after the ‘Reduce’ task.
- Check output data generated after the ‘Reduce’ task with the input files (Input of ‘Map’ task).
3. ETL Process Validation and Report Testing (Testing Results from HDFS) – In this stage, data generated in the second step is stored in the data warehouse i.e. Enterprise Data Warehouse (EDW). In EDM, the data is either analyzed to gain some more insights or reports are generated.
The checklist for this stage is as follows-
- Compare the data stored in the EDW with the HDFS to make sure no data corruption has happened.
- Check whether the correct transformation rules have been applied.
- Check the reports generated by the system to make sure they are correct, include desired data, and have a proper layout.
Non-functional testing is performed in the following ways-
1. Performance Testing – As big data systems process a large amount of data in a short period, it is required to do a performance testing of the system to measure performance metrics such as completion time, data throughput, memory utilization, data storage, etc.
- Derive various performance metrics such as the speed of data consumption, maximum processing capacity, and response time.
- Check the data processing speed of the MapReduce tasks.
- Check data storage at different nodes.
- Check conditions that can result in performance problems.
- Identify configurations that can help in the optimization of the performance.
2. Failover Testing – In this type of testing, the Hadoop framework uses multiple nodes and some of these are bound to fail. Detecting failures of such nodes and recovering is important after a failure.
Failover testing helps in making sure that the system can recover after failure and continue with data processing after switching to other data nodes. ‘Recovery Time Objective’ and ‘Recovery Point Objective’ are the two metrics derived during this testing process.
Challenges in Big Data Testing
The challenges faced during big data testing is one of most frequently asked big data interview questions. Let’s quickly check theses-
- As we discussed earlier, big data testing is a process that deals with a high volume of data. Testing such a huge amount of data can be complicated.
- Also, data used in big data is fetched from various sources and is unstructured. Dealing with such data can become a challenge.
- For such testing, highly skilled and experienced resources are required. Testers need to learn and master the frameworks and big data testing tools used in the process.
- Testers need to continuously study and understand the data on which processing is happening. It includes understanding the business rules, the relationship between the data sets, and the value they will be providing to the users.
- As the various stages of the process use different technologies, there isn’t just one tool that can be used for the whole testing process.
Today, almost all the major organizations have a huge volume of data in the raw form that is collected from different sources mainly in an unstructured manner. Big data helps in storing and analyzing this data and then gaining insights and reports on the data.
This helps organizations to take productive business decisions that will eventually help them in generating revenue and a large customer base. As this process holds much importance for the organization, it should be tested efficiently, and the organization should be able to obtain as many benefits from it as possible.
Kuldeep is the founder and lead author of ArtOfTesting. He is skilled in test automation, performance testing, big data, and CI-CD. He brings his decade of experience to his current role where he is dedicated to educating the QA professionals. You can find him on LinkedIn.