Treffer: Application of Python in Marketing Education: A Big Data Analytics Perspective
Postsecondary Education
2153-9987
Weitere Informationen
In the era of big data, many business organizations consider data analytics skills as important criteria in the acquisition of qualified applicants. As numerous managerial decisions in the field of marketing are becoming evidence-based, business schools have integrated case studies about different stages of data analytics such as problem identification, data collection, data processing, data analysis and data visualization in order to improve the knowledge of marketing students. Although case studies can provide a good theoretical foundation about data analytics in the field of marketing, but they may not be sufficient for building analytical skills from a technical perspective. This paper provides a guideline on how Python as a programming language can be used to explore large datasets and improve marketing students' capabilities with a focus on data processing, data analysis and data visualization tasks. In this research, a survey was conducted to measure the teaching effectiveness and overall satisfaction of marketing students (n = 84) in a Canadian university. The evidence suggests that Python libraries designed for marketing-related data analysis and data visualization have positive outcomes in students' learning experience and perception of teaching effectiveness.
As Provided
AN0169784009;fm401sep.23;2023Aug08.03:25;v2.2.500
APPLICATION OF PYTHON IN MARKETING EDUCATION: A BIG DATA ANALYTICS PERSPECTIVE
<sbt id="AN0169784009-2">Introduction</sbt>In the era of big data, many business organizations consider data analytics skills as important criteria in the acquisition of qualified applicants. As numerous managerial decisions in the field of marketing are becoming evidence-based, business schools have integrated case studies about different stages of data analytics such as problem identification, data collection, data processing, data analysis and data visualization in order to improve the knowledge of marketing students. Although case studies can provide a good theoretical foundation about data analytics in the field of marketing, but they may not be sufficient for building analytical skills from a technical perspective. This paper provides a guideline on how Python as a programming language can be used to explore large datasets and improve marketing students' capabilities with a focus on data processing, data analysis and data visualization tasks. In this research, a survey was conducted to measure the teaching effectiveness and overall satisfaction of marketing students (n = 84) in a Canadian university. The evidence suggests that Python libraries designed for marketing-related data analysis and data visualization have positive outcomes in students' learning experience and perception of teaching effectiveness.
Over the last few years, business schools and faculties of management have focused on designing specific programs with the aim of developing students' skills in the areas of Big Data Analytics and Data Science as important parts of students' managerial skills. For instance, HEC Montreal[1] offers a Master of Science program in Data Science and Business Analytics. Other examples include the Master of Management in Analytics at Desautels Faculty of Management[2] of McGill University located in Canada or Master in Data Sciences and Business Analytics at ESSEC[3] in France. Some MBA programs offer business analytics as a concentration or specialization, like the McCombs School of Business at the University of Texas at Austin. For large academic institutions, the list is not limited. However, smaller business schools are still in learning stage of integrating such field in their program offerings.
According to Graduate Management Admission Council (GMAC, [4]), the Covid-19 pandemic affected all management program admissions and student recruitments across the globe, however, data analytics programs still continue to see a growth in business schools worldwide. US business schools have expanded aggressively into the data analytics space over the span of just a few years. According to the data, 525 MBA programs and 523 business master's programs participated in the 2020 Application Trends Survey. In this survey, 81 business schools from different countries stated that they offer a Master program in Data Analytics. 71 participants reported a growth rate of 73% in student application volumes which is higher than the growth rate in 2018 and 2019. It is worth noting that some academic institutions may not have sufficient technological and human resources to offer an entire program in this field. However, the popularity and importance of this field for management undergraduates have pushed the departments to incorporate some components of this field in their existing courses. The integration of big data analytics material in marketing courses is no exception.
What is Big Data?
Big data refers to high volume data collected by various sources, such as electronic devices, sensors, satellites, social media feeds, and GPS signals (Davenport, Barth, & Bean, [2]). Generally, the amount of data generated by electronic devices exceed the capability of a traditional software to collect, transform, store, analyze and visualize the data. Hence, the exploration of this phenomena requires the availability of more advanced tools and techniques.
According to Oracle ([10]), big data has three main characteristics, such as volume, velocity, and variety. Volume is a term to describe that data size continues to increase and the firms should have the necessary capacity to store the data. Velocity refers to the speed at which the data should be transformed and prepared for analysis. In other words, traditional software may not be sufficient for real-time data processing and analysis. Variety refers to different structures of data, such as text, images, videos, etc. A fourth characteristic known as veracity has been added to describe the quality and truthfulness of data.
Specifically, marketing departments of many large business organizations heavily rely on big data to extract insights and existing patterns about consumer behavior and activities. For instance, the decisions of marketing departments related to segmentation have become highly evidence-based and the tools and techniques continue to evolve. It is important to note that statistical literacy and technological expertise is a requirement in analytical projects. However, the individuals working in this field should also be capable of translating the analysis result of transformed business data. In other words, it becomes essential to sense all business-related issues in order to extract the information for value creation or problem solving. This business relevance is just as important as the mastery of mathematics, statistics, and algorithms. Hence, it is essential that marketing students discover more powerful data analysis software which have the ability to handle high-volume, high-velocity, and high-variety data. Furthermore, students who registered in a business related field are not required to have specialized knowledge to develop algorithms, but they can use a variety of statistical functions or prebuilt algorithms and apply them to marketing, operations, or finance.
Today, evidence-based decision-making skills using automated tools adapts the students' knowledge to the job market needs (Gillentine & Schulz, [3]). It is worth noting that innovative experiential learning methods will facilitate the knowledge acquisition about real-life applications of marketing concepts. Consequently, marketing educators tend to incorporate experiential learning in the classroom, so, students can work with real business datasets (Hawes & Foley, [5]; Li, Greenberg, & Nicholls, [7]). In the extant literature, scholars have discussed the challenges for incorporating AI and data analytics tools in business schools. For example, Thontirawong and Chinchanachokchai ([13]) highlighted the limitations related to teaching guidelines, proper resources, cost of software (Liu & Burns, [8]) and marketing students' lack of technical knowledge which are challenging in academic environments.
Given the obstacles for the integration of AI and data analytics tools, it is important to understand how an open-source software can contribute to learning objectives of Big Data Analytics in marketing courses? Thus, the aim of this study is provide a guideline on the integration of Big Data Analytics in marketing courses using experiential learning methods. In other words, this includes instructions on how to use a powerful open-source software containing Python and other relevant packages for big data processing, big data analysis, and visualization.
Big Data Analytics Process
The real-time unstructured data cannot reveal the existing unknown events unless data engineers and analysts extract hidden patterns and relationships using analytical methods. Although there is a lack of standard definition, the term "analytics" refers to the management of the data lifecycle using scientific techniques and automated tools. This process encompasses business problem identification, collecting, processing, analyzing, and visualizing data. All these tasks are carried out to prepare data for current and future analysis and the ultimate goal is to support decision making. Generally, there are different types of analytics depending on the time of the event and the business question that is being asked. For instance, descriptive analytics tends to answer questions related to events that have already occurred whereas predictive analytics seeks to determine the outcome of an event that might occur in the future. Figure 1 illustrates the process for all types of data analytics.
Graph: Figure 1. Big data analytics project lifecycle.
Business Problem Identification
Every big data analytics exercise requires a well-defined business problem that presents a clear goal of carrying out the analysis. Despite the business challenges, this would help decision-makers to identify the resources that will need to be utilized. At the end of this step, a precise question should be formulated before the data collection task begins. In the context of marketing activities, a business problem could be related to market segmentation, identification of new product opportunities or understanding buyer behavior. For instance, a researcher or a marketing analyst could formulate such specific business problem as "what are homogeneous groups of buyers in our customer dataset?" Furthermore, the importance of required variables, the goal and available resources to carry out the analysis should be highlighted at this stage. Finally, the similarities and buying behavior of each group can be examined in future stages with a focus on measures, such as favorite stores, annual income, brand loyalty, spending score, and frequency of purchase.
Data Collection
Data is a powerful fuel for evidence-based marketing decisions. Many business organizations have realized that in most cases, if more data is being collected, the chances of finding hidden patterns increase. The collection of customer transactions (e.g., purchase, claim, credit card payment) allows companies to target their customers more precisely. Internal sources of data are considered as primary reliable sources of data. The data is often stored by online transaction processing systems known as OLTP. Additionally, organizations that possess ERP (Enterprise Resource Planning) and CRM (Customer Relationship Management) systems continue to monitor and collect data about internal business processes continuously. Despite the transactional datasets which are highly structured, the data can be collected in a semi-structured or unstructured form, such as text on social media platforms, e-mails, claim forms, web browser cookies, etc.
In some cases, data collection entails the use of sampling methods in order to build an analytical model. These models can further be used to target similar customers. To build an appropriate analytical model, the sample should consist of both historical and recent data that is representative of the future customers.
Data Processing
Considering the inconsistencies, incompleteness, and duplication of raw data, organizations are not able to extract hidden patterns from data easily. These challenges can overcome if data engineers and analysts can transform the raw data into valuable information assets. It should be noted that messy data results in messy analysis results and decisions. Therefore, the quality of data plays a significant role in enhancing the accuracy of analysis results.
Basically, data processing includes the application of methods and operations in order to clean, filter, transform, and classify raw data into valuable and organized information assets for further analysis. The main purpose of data processing is to prepare data for analysis by cleaning, organizing, and reducing the data to a relevant size throughout the analytics process. Given the increasing volume of unstructured data, the use of advanced automated tools and machine learning algorithms can facilitate this process.
Data Analysis
Data Analysis refers to the application of logical and statistical techniques to examine a dataset and to draw inferences from data. The aim of data analysis is to identify patterns or relationships that exists within a dataset. The conclusions drawn from the data would help analysts and decision makers to acquire knowledge on various events. The qualitative or quantitative approaches to data analysis depend on the research methodology, data format, and the context in which the data is being prepared. Classification, regression, clustering, anomaly detection and association analysis are among the techniques that can be integrated in marketing courses.
Data Visualization
The visual representation of data is about ways to depict data through a choice of physical forms. By combining statistics and design, the aim of data visualization is to communicate data or information effectively to the readers. Typically, data is visualized in the form of a chart, infographic, diagram, or map. According to (IBM, [6]), data visualization is defined as the process of translating large data sets and metrics into charts, graphs, and other visuals. Exploiting visual perception abilities relates to the scientific understanding of how our eyes and brains process information most effectively.
Utilization of Analysis Results
In the context of quantitative methods, statistical literacy skills play an important role in data interpretation. Statistical literacy focuses the use of statistics as evidence in arguments (Schield, [11], [12]). The absence of interpretation skills might lead to misguidance in the decision-making process. Hence, this step requires strong data literacy skills to ensure more accurate evidence-based decisions.
Python
Python is a free, dynamic, and extensible language that allows a modular and object-oriented approach to programming. Python was first introduced in 1989 and has become popular for educational and scientific computing. It has been developed by Guido van Rossum and many volunteer contributors at National Research Institute for Mathematics and Computer Science in Amsterdam. The syntax of Python is simple and combined with advanced data types for creating computer programs that are very compact in terms of size and easily readable. Considering programs with equal functionality, a Python program is often 3 to 5 times shorter than an equivalent C or C ++ or even Java program, which generally represents a 5 to 10 times shorter development time and greatly increased ease of maintenance. This programming language has captured the attention of marketers due to its simplicity, accessibility, versatility, and automation performance. Some advantages of Python for marketers include automating customer segmentation, customer feedback analysis and A/B testing. This open-source programming language also support marketers in increasing the accuracy of ROI calculations by adopting comprehensive attribution models that will facilitate the identification of channels that bring the highest number of customers. Concerning advertising purposes, sophisticated machine learning models decide which ad should be displayed to which customer and at what time. While focusing on some marketing research analysis such as segmentation or social media text mining, Python is more versatile for pulling data from the web compared to R programming language. In other words, R packages like Rvest are designed for basic web scraping. The different versions of Python for different operating systems and its documentation related to the existing libraries are available for download at: http://www.python.org.
Python Libraries/Packages
A Python library is a collection of functions and methods that allows users to perform many actions without typing lengthy codes. In the context of this research, these actions include complex mathematical and statistical computations. Python libraries have been created by a community of developers and these libraries create an ecosystem that can help with a range of development needs. For instance, most Python statistical functions include long formulas that can be called using a single word. It is important to note that every library includes various functions and each have different purposes. Table 1 illustrates the functionality of some powerful libraries for big data analytics that can be used in marketing courses.
Table 1. Relevant Python libraries for Big Data Analytics.
Jupyter Notebook
Given the availability of different free text editors to write Python codes, it is important to note that a software interface has an important impact on learning outcomes. Therefore, we suggest Jupyter Notebook. This text editor is a free, open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text, and multimedia resources in a single document. The main advantage of Jupyter Notebook is that unlike other text editors, students can execute their codes line by line which can save them a lot of time for error identification and syntax learning.
Installation of Anaconda Python Distribution
Using the traditional method of installation, Python and its libraries can be downloaded from independent websites and each component should be installed separately. The installation and the execution of software including libraries/packages are sometimes a hassle for marketing educators and marketing students. To facilitate the installation process for the students and the educators, we propose
The advantage of using Anaconda include hassle-free installation process by supporting the latest version of Python. Another advantage includes the compatibility with Windows, Mac, and Linux operating systems. Moreover, Anaconda includes other software packages and libraries commonly used in Python programming which allows students to learn and use Python in different fields of study.
Assignment Design
Given that Python is a multi-purpose programming language, the objective of this class is to demonstrate the integration of Big Data Analytics using Python for marketing problem solving. The teaching method in this class allows students to work with sample datasets and acquire technical skills in this field. The students will benefit from several advantages that this tool offers. First, it is used to build anything from simple scripts to complex apps with massive numbers of users. Secondly, the syntax of Python is easier to read for students who are learning their first programming language. Thirdly, it is free and widely available with a massive open-source community who provide additional resources and suggestions for problem-solving. Finally, Python is popular in big data analytics, artificial intelligence and web development which might be areas of interest for marketing students who plan to further develop their skills in these fields.
In this particular assignment, students learn how to work with Anaconda Python in order to collect, clean, filter, analyze and visualize data (see Appendix). The class is divided into 4 sessions (approximately 3 hours each). Given that the students can download Anaconda Python at no cost, many small and medium academic institutions can consider this powerful resource as an attractive solution for problem solving in marketing courses. Additionally, the application is compatible with Windows OS, Mac OS and Linux. It is important to note that this class does not have any specific prerequisite related to computer science or artificial intelligence. However, basic knowledge of statistics and mathematics would be beneficial to facilitate the learning process in data analysis and data visualization parts of the analytics process. According to the extant literature, marketing students try to avoid classes that are quantitative in nature (Bridges, [1]). Hence, the instructor plays an important role in highlighting the simplicity of Python syntax for data analytics.
Introduction to Big Data and Data Analytics (Concepts)
In this part, the students familiarize themselves with big data concepts. The instructor can start the discussion by asking the following questions: (1) what is big data? (2) what are the characteristics of big data? Although there is a lack of standard definition in the literature, the discussion can begin with a general definition of the term "big data" and "analytics." Then, the characteristics of big data such as volume, velocity, variety, veracity, and value can be introduced. Furthermore, the importance of evidence-based decisions should be highlighted which is the reason more organizations have decided to invest in big data technologies. Consequently, the instructor can provide explanation about the objective and advantages of Big Data Analytics in the context of marketing activities.
Data Analytics Typology (Concepts)
Following the previous discussion, four types of data analytics including examples can be presented. They are classified into four categories known as (1) descriptive analytics, (2) diagnostic analytics, (3) predictive analytics, and (4) prescriptive analytics. This will help the students to differentiate and understand the aim of each type of analytics. The main focus of this assignment is related to market segmentation problem solving using descriptive and diagnostic analytics.
Assignment Instructions (Practical)
First, the instructor can provide instructions on how to install Anaconda Python prior to the class. Following the installation, it is necessary to refer to Jupyter Notebook which is the main text editor for Python codes in this class. In this stage, the instructor demonstrates the text editor's interface and its basic functionalities related to modification and the execution of codes. This can be continued by showing some basic Python mathematical and statistical functions. The objectives of the assignment should be clarified and groups of 3–4 students can be formed. Furthermore, the instructor may explain step-by-step tasks of the analytics process (see Appendix). Furthermore, a list of relevant functions that exist in each Python module can be provided to the students.
This part guides students on how to search for accessible external data known as open data. The instructor explains the importance of data that can be collected from internal sources of a business organizations such as ERP, CRM, and other enterprise-wide applications. The importance related to the reliability of data sources should be highlighted. Given that the students don't have access to real datasets, a fictional dataset is suggested to familiarize students with sample variables as well as the structure of transactional data. In this assignment, the instructor can formulate a specific marketing segmentation problem as "what are homogeneous groups of buyers in our customer dataset?" This marketing problem requires the students to perform cluster analysis in future steps. Following the identification of this problem, the student can focus on the dataset provided in this research.
A subsidiary of Google called
Graph: Figure 2. Initial exploration of the marketing dataset using Python.
At this stage, the instructor demonstrates how to import and read a dataset using
The instructor may show several functions of
To facilitate the learning process, the students should ask themselves a few questions. (1) do we need all the variables for cluster analysis? (2) should we transform certain variables? (3) is there any outlier, missing value or strange values? (4) should we create a new variable? Additionally, the instructor can demonstrate how to combine the values of different cells (e.g. concatenation) or how to separate the values in one single cell. For instance, a cell may contain the day, month, and year. Python can extract the "year" value, if this value is the main focus for further analysis. In an opposite scenario, the values of several cells can be combined. The Figure 3 illustrates an example of data cleaning and transformation.
Graph: Figure 3. Identification of missing values and data transformation using Python.
Depending on the purpose of analysis, the instructor can demonstrate functions that are related to cluster analysis, regression, or correlation analysis. However, the instructor may choose other analysis techniques that are relevant to the main analytics problem. Additionally, prior knowledge of introductory statistics can improve the students' learning for data analysis stage. Appendix features a detailed example of cluster analysis codes for Python. An example of cluster analysis of marketing dataset in Python is presented in Figure 4.
Graph: Figure 4. K Means clustering algorithm using Python.
Python has basic and advanced libraries for data visualization. In this assignment,
Graph: Figure 5. Additional data visualization in Python.
Following the demonstration of each stage, the instructor can ask each group to use the indicated functions and discuss the findings or recommendations related to the marketing problem identified in the beginning of analytics process. As the business problem in this assignment is related to segmentation using cluster analysis, each group can give a 5-min presentation to discuss their findings. The content of the presentation should cover the following topics: how many numerical and categorical variables were observed in the marketing dataset? What variables were considered as important for cleaning and analysis? What are the findings of this cluster analysis and how many groups were identified? What conclusions can we draw using the existing evidence of customer segmentation?
Hypothesis Development
According to Marsh ([9]), the quality of learning can be improved by adaptation of the teaching process as a result of students' feedback, and this will have an impact on the student perceptions of teaching and learning environment. Hence, it is essential to measure teaching effectiveness and student involvement in learning process. In the research, the following hypotheses have been developed in order to measure students' perception of teaching effectiveness and their learning experience.
Methodology
Procedure and Sample
The objective of this research is to integrate Big Data Analytics using Python in the context of marketing education. The topic was taught during 4 sessions (12 hours in total) and the participants of the survey (n = 84) were undergraduate marketing students in a Canadian university who discovered the application of big data analytics in business-oriented decisions using experiential learning methods. In this research, a self-administrated electronic questionnaire was sent to 102 marketing students. The final sample was 84, with a response rate of 82%. It is worth noting that this big data analytics course is offered to the students of all business majors, however, the survey has been sent to the undergraduate students who have a major or concentration in marketing.
Measures
In this research, the teaching effectiveness measures were adapted from Li et al. ([7]). This scale consists of four areas and 10 items in total. This measurement is using a seven-point scale from 1 (strongly disagree) to 7 (strongly agree).
Teaching Effectiveness Results
The results of teaching effectiveness are illustrated in Table 2. According to the results, the respondents expressed their learning experience and perception of teaching effectiveness with regards to career preparation (
Graph
Graph
Graph
Graph
Table 2. Measures of teaching evaluation.
1 Career preparation (α =.87), traditional educational goals (α =.95), use of time (α =.96), and personal involvement and satisfaction (α =.86).
The evidence shows that the sample means related to all measures were higher than 5 with standard deviations lower than 1. Additionally, the one-sample Wilcoxon signed rank test has been used as a non-parametric alternative to one-sample t-test. This test is also used when there is an assumption that data is not distributed normally. According to the results, all four hypotheses of this research are supported.
Discussion
From a big data analytics perspective, this research provides a guideline related to the application of Python programming language in the context of marketing education. To the best of the authors' knowledge, this research is among the first to propose an innovative method of teaching big data analytics in marketing education.
Theoretical Contributions
The original research results reported in this paper contribute to the field of marketing education. Given the importance of big data analytics techniques and concepts in business curriculums, this paper brings new evidence related to the impact of new big data analytics tools and techniques on teaching effectiveness in marketing education. In extant literature, the benefits of case studies and conceptual discussions about big data analytics have been highlighted; however, this research shows that the transmission of knowledge about big data analytics concept through a structured practical design and the use more powerful tools can enhance teaching effectiveness in delivering such material.
It is worth noting that in the extant literature, there are various definitions and keywords to describe the term "big data" and the term "analytics." Given the existence of broad definitions and lack of detailed marketing-related examples in previous research, this paper shed light on a more detailed description of each stage of big data analytics process in order to support marketing related decisions.
In spite of the technological changes and necessity for up-to-date program curriculums, small business schools and academic institutions actively search for educational resources to adapt the students' skills to current job market needs. From a theoretical perspective, the results of this research call attention to the increasing importance of new open-source software that can be used for educational purposes and more specifically in the field of marketing.
It is also worth noting that most existing research contributions related to the integration of big data analytics in marketing courses have explored the teaching effectiveness results using small class data and limited sample results. The present paper differs from previous contributions by expanding the scope of analysis and focusing on a large student sample. It is important to note that the sample is taken from marketing students across multiple sections of the same course.
Managerial Implications
In this paper, we contribute to the understanding of analytic-driven decision making in the context of marketing activities. Despite the popularity of the term "big data," many small and medium size academic institutions still face challenges for the integration of such material in their curriculum. In fact, most business schools appear to still be facing a learning stage in terms of the potential benefits of integrating such teaching methods in marketing programs.
This research suggests several lessons for marketing educators and marketing research teams in various organizational environments. First, marketing educators in academic institutions should be aware that specific practical-based examples can enhance teaching effectiveness. Therefore, this can be helpful in transmitting knowledge about technical topics such as Big Data Analytics.
Second, these results have direct implications for other organizational environments that provide training about analytical processes for marketing research teams. For instance, the SMEs that do not have sufficient budgets for big data analytics technologies, can benefit from powerful open-source software like Anaconda Python for marketing research activities. Additionally, another main contribution of this paper is related to step-by-step coding and instructions to develop a K Mean clustering algorithm. Hence, this will enable users to carry out cluster analysis which is beneficial for marketing decisions. The marketing educators in SMEs can use the highlighted instructions of this paper as a structured resource for training purposes. This can be beneficial for marketing departments where research teams have to focus on large marketing-related datasets. Although this article has focused exclusively on cluster analysis examples, it would be interesting to attempt to verify the outcomes using other data analysis and visualization examples. Finally, it is important to note that other open data sources mentioned in this article may support future practice and exercises toward the understanding of other types of analysis for marketing-based decisions.
Research Limits and Perspectives
Today, rapid changes in market needs push the academic institutions to update their program curriculum and prepare the students who can respond to organizations' expectations. Given that the cost of software and automated tools in classroom is considered as one of the biggest challenges in academic institutions, this research proposes a powerful and free software that can be integrated in marketing courses to familiarize students with big data analytics concepts. Canadian Universities and business schools have a multicultural and diverse student pool. One of the limitations of this research is related to the lack of consideration in multicultural aspect of participants. Another limitation of teaching big data analytics is that the undergraduate students of social science have different educational and technical backgrounds. The educators can enhance the learning process by providing a video file (using screen recording software) or a text file which contains the step-by-step instructions. Therefore, the students can follow each step at their own pace during the class time.
A promising route of future research would be to study in detail the impacts of such tools on learning outcomes and also challenges faced by students. Although we present evidence related to marketing students, it would be interesting to evaluate the impacts of users' distinct educational backgrounds on the perceived learning outcomes or teaching effectiveness.
The result of this research shows that the knowledge of marketing students in the field of big data analytics has improved and the use of Python in marketing education had a positive outcome in perceived teaching effectiveness related to such concepts and processes. Additionally, marketing students have acquired technical skills that is an added-value for their current and future job applications. In addition to marketing education goals, the benefits of such technological resources may enhance teaching performance in other programs and fields of study.
Acknowledgments
The authors wish to thank Dr. Barbara Wooldridge and three anonymous reviewers for their invaluable feedback in preparing this article.
Disclosure Statement
No potential conflict of interest was reported by the author(s).
Appendix
Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. The purpose of this assignment is to explore a marketing dataset and identify customers' similarities using a K Mean clustering algorithm. It is necessary to execute each block of Python code by clicking on the "Run" button in Jupyter Notebook.
As a part of market segmentation goal, we would like to put customers into groups and identify the clusters based on the similarities in their annual income and spending score (1 to 100).
<bold>Step 1</bold>. We will begin the analytics process by loading the necessary Python packages and importing the market sample dataset. Then, we will explore the shape and the type of variables included in the dataset.
%matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns
The following code will allow you import and read the sample dataset which is in the CSV format. In this assignment, we have decided to call our dataset "
market_data = pd.read_csv('Marketing_Dataset.csv')
<bold>Step 2</bold>. The initial exploration of the data begins after importing the dataset. The following function allow you to identify the number of rows and columns in a dataset.
market_data.shape
The "dtypes" function identifies the data types in a data set. In other words, it can detect the numerical and categorical variables. For instance, the following command will refer to "Gender" as "Object," because "Gender" is a categorical variable. Additionally, other variables such as "Annual income" will be referred to as "Integer," because it contains numerical values.
market_data.dtypes
The "head" function displays the rows in a dataset. This is useful when you need to observe the nature of variables before any processing and analysis. For instance, we can observe all the variables including the first five rows by executing the following command.
market_data.head(5)
The "describe" function displays descriptive statistics related to all numerical values in a dataset. This will give you an idea about the mean, standard deviation, minimum value, maximum value, 25th percentile, median and 75th percentile.
market_data.describe()
<bold>Step 3</bold>. Data processing using Python allows you improve the quality of your data prior to analysis. This includes the identification of missing values which can be replaced by other values. Additionally, you can delete or transform the variables which are not important for analysis. For instance, it is not necessary to keep the "CustomerID" for this analysis. Therefore, we will use the "drop" function to remove the entire column.
market_data.drop(["CustomerID"],axis=1, inplace=True)
The following command will identify the missing values. As a data analyst, we can decide if we have to find an alternative for missing values or leave them as they are. This is important if those variables have to be considered for further analysis.
market_data.isnull().sum()
After executing this code, we can observe that some columns such as "profession," "work experience" and "family size" contain missing values.
<bold>Step 4</bold>. Data Analysis and Visualization
We can start our initial analysis and visualization by creating simple chart or graphs. This will give us an idea about the distribution of data. The following code allows us to displays the distribution of customer "Gender" in the market dataset. We observe that the females are in the lead as compared to males. The code in the first row allows us to customize the size of this chart.
plt.figure(figsize = (15,5)) sns.countplot(y='Gender', data=market_data) plt.show()
The following code displays the distribution of customer "Age" in the market dataset.
plt.figure(figsize = (15,5)) sns.countplot(market_data['Age'], palette = 'hsv') plt.title('Distribution of Age', fontsize = 20) plt.show()
The following code displays the distribution of customer "spending score" which is from 1 to 100.
plt.figure(figsize=(15,5)) sns.countplot(market_data['Spending Score (1-100)'], palette = 'copper') plt.title('Distribution of Spending Score', fontsize = 20) plt.show()
The following code performs a bi-variate analysis between "Gender" and "Spending Score." It is clearly visible that the most of the males have a Spending Score of around $25,000 to $70,000 whereas the females have a spending score of around $35,000 to $75,000 which again points to the fact that women are shopping leaders.
plt.figure(figsize= (15,5)) sns.boxenplot(market_data['Gender'], market_data['Spending Score (1-100)'], palette = 'Blues') plt.title('Gender vs Spending Score', fontsize = 20) plt.show()
<h31 id="AN0169784009-31">Cluster Analysis</h31>In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. First, we have to select the features based on which we want to create clusters. In the following example, we create a variable "x" to show that we want to include "Age" and "Spending score" as features for the creation of clusters.
x = market_data.loc[:, ["Age","Spending Score (1-100)"]].values print(x.shape)
The following code imports the K mean function from the Sklearn library. Then, we can use the elbow method to find the optimal number of clusters.
from sklearn.cluster import KMeans wcss = [] for i in range(1, 11): km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0) km.fit(x) wcss.append(km.inertia_) plt.plot(range(1, 11), wcss) plt.title('The Elbow Method', fontsize = 20) plt.xlabel('No. of Clusters') plt.ylabel('wcss') plt.show()
The optimal number of clusters based on customer age and spending score is four. The following code visualize the data using KMeans clustering.
kmeans = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0) ymeans = kmeans.fit_predict(x) plt.rcParams['figure.figsize'] = (10, 10) plt.title('Cluster of Ages', fontsize = 30) plt.scatter(x[ymeans == 0, 0], x[ymeans == 0, 1], s = 100, c = 'pink', label = 'Usual Customers') plt.scatter(x[ymeans == 1, 0], x[ymeans == 1, 1], s = 100, c = 'orange', label = 'Priority Customers') plt.scatter(x[ymeans == 2, 0], x[ymeans == 2, 1], s = 100, c = 'lightgreen', label = 'Target Customers(Young)') plt.scatter(x[ymeans == 3, 0], x[ymeans == 3, 1], s = 100, c = 'red', label = 'Target Customers(Old)') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 50, c = 'black') plt.xlabel('Age') plt.ylabel('Spending Score (1–100)') plt.legend() plt.grid() plt.show()
We can repeat the same process for another cluster analysis. This time, we can consider annual income and spending score as important features.
The following code imports the K mean function from the Sklearn library. Then, we can use the elbow method to find the optimal number of clusters.
X = market_data.loc[:, ["Annual Income","Spending Score (1–100)"]].values print(x.shape)
Finding the optimal number of clusters
from sklearn.cluster import KMeans plt.figure(figsize=(22,8)) wcss = [] for i in range(1, 11): km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0) km.fit(x) wcss.append(km.inertia_) plt.plot(range(1, 11), wcss) plt.title('The Elbow Method', fontsize = 20) plt.xlabel('No. of Clusters') plt.ylabel('wcss') plt.show()
The optimal number of clusters based on customer annual income and spending score is five. The following code visualize the data using KMeans clustering.
km = KMeans(n_clusters = 5, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0) y_means = km.fit_predict(x) plt.figure(figsize=(22,8)) plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink', label = 'Cluster 1') plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow', label = 'Cluster 2') plt.scatter(x[y_means == 2, 0], x[y_means == 2, 1], s = 100, c = 'cyan', label = 'Cluster 3') plt.scatter(x[y_means == 3, 0], x[y_means == 3, 1], s = 100, c = 'magenta', label = 'Cluster 4') plt.scatter(x[y_means == 4, 0], x[y_means == 4, 1], s = 100, c = 'orange', label = 'Cluster 5') plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'Center') plt.style.use('fivethirtyeight') plt.title('K Means Clustering', fontsize = 20) plt.xlabel('Annual Income') plt.ylabel('Spending Score') plt.legend() plt.grid() plt.show()
<bold>Step 5</bold>. Observations and inferences
We could have identified the number of clusters. The first visualization helps us draw conclusions based on the customer age and spending score for each cluster.
This cluster analysis gives us a very clear insight about the different segments of the customers in the marketing dataset. There are four segments of customers based on their age spending score which are appropriate factors/attributes to determine the customer segments.
The second visualization helps us identify customer segments based on the annual income and spending score of each customer. By considering these variables, five segments have been identified in the second cluster analysis.
<bold>Note</bold>: The SKlearn library's official user guide can be found at https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
References
1 Bridges, E. (1999). Experiential learning and customer needs in the undergraduate marketing research course. Journal of Marketing Education, 21 (1), 51 – 59. doi: 10.1177/0273475399211007
2 Davenport, T. H., Barth, P., & Bean, R. (2012). How big data is different. MIT Sloan Management Review, 54 (1), 43 – 46.
3 Gillentine, A., & Schulz, J. (2001). Marketing the fantasy football league: Utilization of simulation to enhance sport marketing concepts. Journal of Marketing Education, 23 (3), 178 – 186. doi: 10.1177/0273475301233003
4 GMAC. (2020). Application trends survey report 2020. Graduate Management Admission Council. Retrieved from https://www.gmac.com/market-intelligence-and-research/research-library/admissions-and-application-trends/2020-application-trends-survey-report
5 Hawes, J. M., & Foley, L. M. (2006). Building skills with professional activity reports. Marketing Education Review, 16 (1), 35 – 40. doi: 10.1080/10528008.2006.11488935
6 IBM. (2021). What is data visualization? Retrieved from https://www.ibm.com/analytics/data-visualization
7 Li, T., Greenberg, B. A., & Nicholls, J. (2007). Teaching experiential learning: Adoption of an innovative course in an MBA marketing curriculum. Journal of Marketing Education, 29 (1), 25 – 33. doi: 10.1177/0273475306297380
8 Liu, X., & Burns, A. C. (2018). Designing a marketing analytics course for the digital age. Marketing Education Review, 28 (1), 28 – 40. doi: 10.1080/10528008.2017.1421049
9 Marsh, H. W. (1982). SEEQ: A reliable, valid, and useful instrument for collecting students' evaluations of university teaching. British Journal of Educational Psychology, 52 (1), 77 – 95. doi: 10.1111/j.2044-8279.1982.tb02505.x
Oracle. (2012). What is big data? Retrieved from https://www.oracle.com/ca-en/big-data/what-is-big-data/
Schield, M. (1998). Statistical literacy and evidential statistics. Paper presented at the ASA Proceedings of the Section on Statistical Education. Retrieved from www.StatLit.org/pdf/1998SchieldASA.pdf.
Schield, M. (1999). Statistical literacy: Thinking critically about statistics. Of Significance, 1 (1), 15 – 20.
Thontirawong, P., & Chinchanachokchai, S. (2021). Teaching artificial intelligence and machine learning in marketing. Marketing Education Review, 31 (2), 58 – 63. doi: 10.1080/10528008.2021.1871849
Footnotes
https://www.hec.ca/en/programs/masters/master-data-science-business-analytics/.
https://www.mcgill.ca/desautels/programs/mma.
https://www.essec.edu/en/program/mscs/master-data-sciences-business-analytics/.
Anaconda Python Distribution is available for download at www.anaconda.com.
For future assignments, additional sample datasets can be downloaded from www.kaggle.com.
Sample sales dataset can be downloaded from https://www.mediafire.com/file/0s40unyg0q0jbca/Marketing%5fDataset.csv.
By Aria Teimourzadeh; Samaneh Kakavand and Benjamin Kakavand
Reported by Author; Author; Author