Big Data, methods and techniques of big data analysis
Wikipedia as of mid-2018, he gave the following definition of Big Data:
“Big data (Big Data) — marking of structured and unstructured data of enormous volume and significant variety, effectively processed horizontally scalable software tools that emerged in the late 2000s, years and alternative to traditional systems database management and solutions Business Intelligence”.
As you can see, this definition includes such vague terms as “enormous”, “substantial”, “effective” and “alternative”. Even the name itself is very subjective. For example, 4 Terabytes (capacity of modern external hard drive for laptop) — it is big data or not? By this definition, Wikipedia adds the following: “in the broad sense of “big data” referred to as socio-economic phenomenon related to the emergence of technological capabilities to analyze large amounts of data in some areas — the entire world the amount of data, and the resulting transformational consequences.”
Analysts IBS “all the world’s data volume” estimated these values:
2003 — 5 exabytes of data (1 EB = 1 billion gigabytes)
2008 — 0.18 zettabytes (1 ZB = 1024 exabytes)
2015 — more than 6.5 zettabytes
2020 — 40-44 zettabytes (forecast)
2025 — this volume will grow 10 times.
The report also notes that most of the data to generate are not ordinary consumers, and predpriyatiya1 (remember the Industrial Internet of things).
You can use a more simple definition, it is consistent with the established opinion of journalists and marketers.
“Big data is a collection of technologies that are designed to perform three operations:
- To handle large compared to “standard” scenarios amounts of data
- To be able to work quickly with the incoming data in very large volumes. That is, data is not just a lot, and they are constantly becoming more and more
- To be able to work with structured and weakly structured data in parallel in the different aspects of the”2
It is believed that these “skills” allow you to reveal hidden patterns, slipping from limited human perception. This gives an unprecedented opportunity to streamline many areas of our life: public administration, medicine, telecommunications, Finance, transport, production and so on. It is not surprising that journalists and marketers so often used the phrase Big Data, what many experts believe this term is discredited and propose to abandon it.3
Moreover, in October 2015, Gartner has ruled out Big Data from a number of popular trends. His decision analysts explained that the concept of “big data” includes a large number of technologies already are actively used in enterprises, they partially belong to other popular areas and trends and have become a daily working tool.4
Anyway, the term Big Data is still widely used, as exemplified by our article.
Three “V” (4, 5, 7) and the three principles of big data
The defining characteristics of big data are, in addition to their physical volume, and others, emphasizing the complexity of the task of processing and analyzing these data. The set of signs of the VVV (volume, velocity, variety — the physical volume, the rate of growth in data and the need for their fast processing, the ability to simultaneously process data of different types) was developed by company Meta Group in 2001 with the aim to point out the equal importance of data management in all three aspects.
In the future, there are interpretations with four V (added veracity — the authenticity), the five V (viability — the viability and value — value), family V (variability — variability and visualization — visualization). But IDC, for example, interpretirovat it is the fourth V, value (value), emphasizing the economic feasibility of processing large volumes of data under appropriate conditions.5
Based on the above definitions, the basic principles of big data are:
- Horizontal scalability. It is a basic principle of big data processing. As already mentioned, big data is every day becoming more and more. Accordingly, it is necessary to increase the number of compute nodes on which these are distributed data and processing should occur without degradation of performance.
- Resiliency. This principle follows from the previous one. Since the compute nodes in the cluster can be many (sometimes tens of thousands) and their number, possibly, will increase, and increase the probability of machinery breakdown. Methods of working with big data should consider the possibility of such situations and provide preventive measures.
- The data locality. Since the data is distributed over a large number of compute nodes that are physically located on the same server, and processed on the other, the data costs may become unreasonably large. Therefore, the data processing is preferably carried out on the same machine on which they are stored.
These principles differ from those typical of traditional, centralized, vertical storage models of well-structured data. Accordingly, for big data to develop new approaches and technologies.
Technologies and trends work with Big Data
Initially, a set of approaches and technologies tools include massively parallel processing vaguely structured data, such as DBMS NoSQL, MapReduce and Hadoop tools. In further to big data technologies began to carry and other solutions that provide characteristics similar to the handling capabilities of the ultralarge data sets, as well as some hardware.
- The MapReduce model of distributed parallel computing in computer clusters provided by Google. According to this model the application is divided into a large number of identical elementary jobs running on the cluster nodes and then in a natural way reduced in the final result.
- NoSQL (from the English. Not Only SQL, not only SQL) is a General term for various non-relational databases and storage, and does not denote any particular technology or product. Conventional relational databases are well suited to relatively fast, and similar queries, and for complex and flexible-built queries that are typical for large data load exceeds reasonable limits and the use of database becomes inefficient.
- Hadoop is a freely distributable set of tools, libraries and framework for the development and execution of distributed programs running on clusters of hundreds or thousands of nodes. Considered one of the fundamental technologies of big data.
- R is a programming language for statistical data processing and graphics. Widely used for data analysis and become the de facto standard for statistical software.
- Hardware solutions. Corporation Teradata, EMC and others offer hardware and software packages designed for processing big data. These complexes are supplied as ready to install telecommunications cabinets containing the cluster servers and management software for massively parallel processing. They also sometimes include hardware solutions for analytical processing in-memory, in particular, hardware and software systems of SAP Hana and complex Exalytics Oracle, despite the fact that this treatment is not inherently massively parallel, and the memory of a single node was limited to a few terabytes.6
McKinsey, in addition to considered by most analysts technologies NoSQL, MapReduce, Hadoop, R, includes in the context of the applicability of big data technology Business Intelligence and relational database management system supporting the SQL language.
Methods and techniques of big data analysis
International consulting company McKinsey, specializing in solving problems related to strategic management, highlights 11 analysis methods and techniques applicable to big data.
• Class methods Data Mining (data mining, data mining, data mining) — a set of detection methods in data previously unknown, nontrivial, practically useful knowledge needed for decision-making. Such methods include the learning of associative rules (association rule learning), classification (division into categories), cluster analysis, regression analysis, detection and analysis of deviations, etc.
• Crowdsourcing — classification and enrichment of the data by the broad, indefinite circle of persons performing the work without joining the employment relationship
• Blending and data integration (data fusion and integration) is a set of techniques that integrate heterogeneous data from a variety of sources to conduct in-depth analysis (for example, digital signal processing, natural language processing, including tone analysis, etc.)
• Machine learning, including supervised learning and unsupervised — use of models constructed on the basis of statistical analysis or machine learning to produce comprehensive forecasts for base models
• Artificial neural networks, network analysis, optimization, including genetic algorithms (genetic algorithm — heuristic search algorithms used for solving optimization problems and modelling by random selection, combination and variation of the desired parameters using the mechanisms similar to natural selection in nature)
• Pattern recognition
• Predictive Analytics
• Simulation (simulation) method, allowing to build the models describing processes how they would pass actually. Simulation can be regarded as a kind of experimental tests
• Spatial analysis (spatial analysis) is a class of methods that use topological, geometric and geographic information extracted from data
• Statistical analysis — time series analysis, A/B testing (A/B testing, split testing — method of marketing research; when using the test element group is compared to a set of test groups in which one or several indicators was changed, in order to find out which changes improve the target)
• Visualization of analytical data the presentation of information in the form of drawings, diagrams, using interactive features and animation as for getting results, and for use as source data for further analysis. A very important step for big data analysis, allowing to present the most important results of the analysis in the most readable form.7
Big data in the industry
According to a report by McKinsey, “Global Institute, Big data: The next frontier for innovation, competition, and productivity”, data has become as important a production factor as labor and production assets. Through the use of big data, companies can gain tangible competitive advantages. Big Data technologies can be useful for solving the following tasks:
- forecasting the market situation
- marketing and sales optimization
- product development
- management decisions
- increase productivity
- efficient logistics
- monitoring state of the main фондов8,9
Industrial enterprises of large data generated also due to the introduction of technologies for the Industrial Internet of things. In this process, the main components and parts of machines and machines are equipped with sensors, actuators, controllers and, sometimes, inexpensive processors that are capable of producing boundary (vague) calculations. In the course of the production process is continuous data collection and, perhaps, their pre-processing (e.g. filtering). Analytical platform to handle these amounts of data in real-time, present results in the most readable form and retain for further use. Based on the analysis of the obtained data conclusions are made about the condition of the equipment, its performance, quality of products, the need for changes in processes, etc.
By monitoring information in real-time, plant personnel can:
- to reduce the number of outages
- to improve the performance of the equipment
- to reduce the operating costs of equipment
- to prevent accidents
The last point is particularly important. For example, operators working in the petrochemical industry, earn an average of approximately 1,500 alarms per day, i.e. more than one message per minute. This leads to increased fatigue of operators who have to constantly make instant decisions on how to respond to a particular signal. But analytical platform can filter out nonessential information, and then the operators get the opportunity to focus primarily on critical situations. This allows them to more effectively identify and prevent accidents and possible accidents. This results in improved levels of production reliability, safety, availability of manufacturing equipment, regulatory compliance.10
In addition, the results of big data analysis you can calculate the payback period, prospects for changes of technological regimes, the reduction or reallocation of staff — i.e., to make strategic decisions regarding further development of the enterprise.11