Most companies are exploring ways and means to streamline the information that is available to them. And Big Data is trending big time these days. So what then is Big Data? Sreelata tells us about it, exclusively in Different Truths.
Today businesses, on a day to day basis, are swamped with enormous amount of data. The data sets are so huge and challenging that the usual data processing applications are unable to handle them. Commonly used software tools that search, capture, analyse, curate, manage, transfer, update and process data within a required time-frame have become infructuous, due to its sheer volume. This outsized or immense amount of data, which include both regulated and unregulated information, is sourced generally from three categories of data namely 1) Streaming data (IT/online), 2) Social media data 3) Publicly available or Open source (government).This then is what describes ‘Big Data’.
It was soon realised that if Big Data had to be made relevant to businesses it required new forms of techniques, technologies and analytical methods to extract value from it. For it to be of any use or importance it needed to be ‘analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions’. Therefore, it was not the data size that mattered but how it could be utilized is how now, Big Data is defined.
Coined first by John Mashey, chief scientist at Silicon Graphics in the 1990s, the characterisation of Big data was further formulated into the ‘3 V s’ model by Doug Laney, Data Analyst at Gartner, which is what most of the industry presently follows.
Big Data comprises of:
Volume: Large amount being generated all the time – from everywhere – business transactions, digital processes, social media.
So new class of technologies is required – storage platforms like Hadoop and related tools like Yarn, Map, Reduce, Spark, Hive, and Pig, NoSQL databases
Velocity: It comes at an alarming speed, which now required tools like RFID tags, sensors, and smart metering to handle it in real time.
Variety: It comes in multiple forms: Email, video, numerical, online, offline, customer data, competitive, financial transactions so on and so forth.
It is also variable or inconsistent and extremely complex with its veracity having to be confirmed.
So for harnessing optimal processing power, analytics capabilities, and new skills, businesses are investing in courses, training new generation Big Data experts and creating new business models.
According to IBM, Big Data is changing the way people within organisations work together. It is creating a culture in which business and IT leaders must join forces to realize value from all data. From data-driven marketing and ad targeting to the connected car, big data is fueling product innovation and new revenue opportunities for many organisations. Combining Big Data inputs with high-powered analytics will allow employees to make better decisions, cut costs, optimize processes and profits and prevent threats and swindles. It can be used to collate, analyze, understand and predict results.
For example, Big Data can be used in Sports to improve training and understand competitors using sports sensors. It could also by using Big Data analytics to predict winners in a match or the future performances of players. Based on the data collected throughout the season, a player’s value and salary could be decided.
In Formula One races, race cars with hundreds of sensors generate terabytes of data. These sensors collect data points from tire pressure to fuel burn efficiency. Then, this data is transferred to team headquarters in the United Kingdom through fiber optic cables that could carry data at the speed of light. Based on the data, engineers and data analysts decide whether adjustments should be made in order to win a race. Besides, using big data, race teams try to predict the time they will finish the race beforehand, based on simulations using data collected over the season.
Similarly, it can be used in every field – banking, healthcare, education, retail, manufacturing government.
Big Data analysis played a major role in helping the BJP win the Indian General Election 2014
Storage Platforms and Related Tools
Hadoop: Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. Hadoop clusters are known for boosting the speed of data analysis applications (www.sas.com/en_us/insights/big-data/hadoop.html )
Prominent users of Hadoop clusters are Facebook which has the largest Hadoop cluster in the world – Google, Yahoo and IBM.
Yarn, Map Reduce, Spark, Hive and Pig, NoSQL databases are all key tools which considerably enhance Hadoop’s viability one way or the other.
YARN (Yet Another Resource Negotiator) is a cluster management technology and a key feature in the Hadoop -2 version.
MapReduce is composed of several components, including: JobTracker – the master node that manages all jobs and resources in a cluster TaskTrackers – agents deployed to each machine in the cluster to run the map and reduce tasks JobHistoryServer – a component that tracks completed jobs, and is typically deployed as a separate function or with JobTracker
To distribute input data and collate results, MapReduce operates in parallel across massive cluster sizes. Because cluster size doesn’t affect a processing job’s final results, jobs can be split across almost any number of servers. Therefore, MapReduce and the overall Hadoop framework simplify software development.
Apache Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers. It can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores such as Apache Hive. Spark supports in-memory processing to boost the performance of big data analytics applications, but it can also do conventional disk-based processing when data sets are too large to fit into the available system memory.
NoSQL database, Not Only SQL, is an approach to data management and database design that’s useful for very large sets of distributed data. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that’s stored remotely on multiple virtual servers in the cloud.
Hive Hadoop Component is used mainly by data analysts whereas Pig Hadoop Component is generally used by Researchers and Programmers. 2) Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi-structured data.
Photos from the internet.
#ScienceAndTechnology #BigData #DataStorage #InformationTechnology #DifferentTruths
Latest posts by Sreelata Menon (see all)
- 10 Major Takeaways from 2017: The Year That Was and the Year That Could Be - December 29, 2017
- Tipu Sultan: A Hero or a Charlatan? - November 5, 2017
- Godmen: Happiness Comes with a Price Tag - September 10, 2017