Treffer: Didžiųjų duomenų klasterizavimas ir klasifikavimas
Weitere Informationen
In today’s world more and more data are collected and generated by digital devices every day. They are characterized not only by the exceptional volume and velocity at which they have to be saved and processed, but by their variety too. Most of this data are unstructured or just semi-structured, in order to preserve their veracity and value, Big Data technologies and techniques have to be used. Various data mining tasks, such as data clustering and classification, can be utilised for extracting information from collected material. However, most of the regular clustering and classification algorithms are not well suited for Big Data analysis. When using them, data have to be preprocessed by reducing the data features subset or selecting just a sample of available material. Clustering and classification algorithms can be applied to Big Data by performing them in parallel or in a distributed network of multiple devices. Various Big Data technologies, such as MapReduce programming model, Apache Hadoop framework and Apache Spark Big Data engine, can be used for this purpose too. They allow to perform Big Data analysis without putting too much effort into distributing data or calculations and focusing only on developing functionality for finding useful information.