Data Science Terminology 101: 31 Must-Know Definitions

Introductory Definitions Businesses Need to Know When Embarking on a Data Science Journey

Have you ever searched for something only to be bombarded later with a plethora of ads for related services or products? Or crazier still, have you mentioned a product name aloud and been greeted with an ad for that same service a few hours later? The phenomenon is related to data collection, and businesses are using this unprecedented amount of data from our everyday browsing habits to inform their business decisions.

Data Science Terminology 101: 31 Must-Know Definitions

At the core of this process is data science—a complex, vast field with its own set of lingo that to the untrained ear might as well be Klingon, Dothraki, Alienese, or Elvish. But as with any discipline, becoming familiar with the basic terminology and frequently used terms is a great place to start learning.

In this blog, we’ll explore data science terminology and demystify commonly used words in the field of data science to help you understand the discipline that’s shaping every industry (from finance to retail). That way, you’ll be prepped to uncover the great benefits data science could have for your business.

What is Data Science?

Data science pertains to the multidisciplinary process of finding actionable insights from expansive sets of raw and structured data. To explain more simply, it’s the process of discovering which business questions you should be asking and then working with data to find answers to those questions. Data science experts work with a number of techniques and disciplines to obtain these answers, including computer science, predictive analytics, artificial intelligence (AI), statistics, and machine learning. Data science is also known to some as “advanced analytics” or “predictive analytics” — not to be confused with data analytics, which we’ll get into in just a moment.

Specializations Within Data Science

As you can see, the definition of data science introduces a few more buzzwords: artificial intelligence, machine learning, and predictive analytics. As the use of data becomes more prevalent across industries, it has opened the door for the creation of new disciplines that fall under the data science umbrella. Let’s look at some of those specialized fields:

  • Artificial Intelligence (AI): Asking Alexa “what is artificial intelligence?” is a very meta way to find the definition for this term. Smart assistants like Alexa and Siri are provided with information so that they develop artificial intelligence, a.k.a. the intelligence displayed by machines that is meant to mimic human intelligence. As a field, AI is a computer science discipline that involves creating smart machines capable of completing tasks once only possible by humans.

    To achieve the feat of creating machines, computers, or robots with AI, algorithms that can predict future events, solve problems, or complete tasks are given to the machine to imbue it with knowledge. Today, AI can be seen in many modern-day applications, from robots used to perform surgeries, to self-driving cars to the show recommendations you’re given when logging onto Netflix.
  • Business Intelligence (BI): Any company or business that is leveraging existing and historical data to inform its decisions is employing business intelligence (BI). The goal of employing BI software is to organize and analyze existing data to identify patterns that can help improve strategies and decision-making. Business intelligence software can take raw data and turn it into actionable insights for organizations that have the potential to improve revenues, productivity, and customer retention.

  • Big Data: We can all agree data is a set of facts and statistics. So, it makes sense that “big data” is a very simple way of saying “data that is so massive in scale a new term was required to convey its sheer size and complexity.”

The easiest way to understand this term is to think of the steadily increasing number of connected devices and their ability to track almost everything, including financial records, social media interactions, search history, documents, multimedia files, stock exchange information, business transactions, and more. As of January 2021, there are 4.66 billion active internet users, and the amount of data we collectively produce continues to hit new highs as the number of devices and users increases. By some estimates, each person using a device is responsible for creating 1.7 MB of data every minute. To put that in perspective, that’s the equivalent of creating a “War and Peace”-sized footprint every 60 seconds.

That massive amount of information is “big data,” and it’s best understood through the 3 V’s of big data (i.e., volume, velocity, and variety) which speak to the massive amount of data (volume) being produced very quickly (velocity) from an increasing amount of sources (variety).

  • Machine Learning (ML): When algorithms are plugged into computers to make sense of a set of data and make predictions based on that information, this is known as machine learning. It’s important to note that ML is based on historical data as opposed to human-created rules. This discipline is seen as an offshoot of artificial intelligence.
  • Data Analytics: Data analytics pertains to the processing of raw data to gain actionable business insights. By using a variety of methods to explore, analyze, and interpret data, companies gain the ability to make more educated business decisions. In many ways, data analytics sounds similar to business intelligence, but to differentiate these terms, think of data analytics as processing data to predict future events, whereas business intelligence uses past data to make decisions in the present.

The Stages of Data Science

When it comes to data, accuracy is everything, which is why it’s crucial to approach it from a methodical, scientific manner. As a result, you’re likely to see the same stages across many data science efforts. Let’s take a look at some terms that are commonly used when discussing these stages.

  • Workshop: The first step is to build the case for moving your data science operation. This step begins with taking a holistic view of the current state of the system including current use cases, cost, and technical requirements. Organizations should evaluate use cases based on domain and complexity and examine cost based on factors such as licenses and support. This way, there will be a complete understanding of needs and requirements. Finally, in what is typically a two-day workshop, Python and R training kicks off for legacy staff that may only know SAS or other legacy systems.

  • POC: Once there is a solid understanding of the current state, the next step is to identify and move a handful of cases that are well defined and capable of delivering meaningful returns when completed. The data migration process begins for one or two of these cases as well as an exploration using Python / R. The goal is for Maven Wave to “teach the team to fish” as the collaboration delivers discovery results. (Note: It’s possible to skip the POC stage and go straight to Pilot if there is a high level of confidence in the proposed solution.)

  • Pilot: As a next step, the stakes are raised as efforts broaden. With a pilot, all new models and use cases execute in the cloud, and templates deploy, making it simple to expand and scale all programs. The goal is to build from strength to strength in an organic manner that is specifically aligned with business needs and objectives. At this stage, the one-offs of POCs are now starting to intertwine and produce dynamic, exponential growth and returns.

  • Projects: The project stage pulls the pieces together as all existing and new use cases move to the new framework. The data science modernization effort starts to mature as multi-tenant hosting, visualization, and dashboards take shape. Data strategy evolves to encompass all distributed data pipelines (e.g., Airflow, Spark), as well as scalable databases (e.g., Snowflake, BigQuery) and governance for users and groups, is fleshed out. In all, the goal is to evolve to the point where a mature model lifecycle is achieved.

  • Production: An example of production is prepping models for this space in Vertex AI, Sagemaker, or Azure ML. Deployments are handled outside of the platform using a combination of deep learning VMs made by Google/AWS/Azure, IAC In the form of Terraform, containerization via Kubernetes, API hosting via Apigee and/or pubsub, and some overarching CI/CD.

4 Common Types of Analytics

Related to data analytics, machine learning, and business intelligence, you might hear four different types of analytics thrown around to describe the type of analysis being conducted. Here are those four types of analytics broken down: 

  • Descriptive Analytics give an account of something that has happened (e.g., 10 customers made purchases, more than 50 people downloaded our white paper, etc.). Descriptive analytics aids in developing key performance indicators (KPIs) that can track success or failures based on actions being taken.

  • Diagnostic Analytics help to answer questions about why things have happened. They work by supplementing more basic descriptive analytics. Diagnostic analytics dig in deeper to find more information about why something happened (e.g., further investigating KPIs to find why they got better or worse).

  • Predictive Analytics work to answer questions about what should be done by relying on historical data and past trends. By using information that shows a person of object’s behavior, leaders can make predictions on how that behavior might change in the future. This not only allows businesses to make more informed decisions but can also help with analyzing past decisions.
  • Prescriptive Analytics aims to identify specific actions that one should take to research their future targets or goals. What measures can you put into place to get the desired results from your actions in the future?

Key Data Science Terms Explained

Now that you see the big picture, here are some common phrases you’ll hear in relation to the disciplines above. Specifically, these are terms to know to make further sense of how data science, artificial intelligence, big data processing, and machine learning are made possible.

  • Algorithms: Used heavily in machine learning, algorithms are the information fed to computers so they are able to run provided values or data through a certain formula. Algorithms can be simple: like instructions for a step-by-step process, or more complex: like trying to predict future revenue based on data.
  • Structured Data: When your data is organized, formatted, and searchable in a database, you’re dealing with structured data. Think of this as data that could easily be plugged into an Excel sheet because its information easily fits into predefined boxes.
  • Unstructured Data: When your data is imported in its native format and therefore exists in a varying number of formats and sizes with differing information specified, you’re dealing with unstructured data. Think of this as a folder on your computer that includes information in a variety of formats—from PDFs to images to videos to presentations.

  • Data Mining: When presented with a set of data, a computer is able to scan that information and identify related variables that may influence future outcomes. This is known as data mining. This is the core process of data science initiatives and is also known as knowledge discovery in data (KDD).

  • Data Wrangling: Also known as munging, data wrangling is the process of cleaning and organizing data so that it can be more readily available and usable. Sometimes data consulting work, or identifying and organizing all existing data points, streamlines this process and is performed separately from a data scientist’s duties depending on the way the data science team is organized. Remember the ensuing panic of Y2K as the world counted down to the year 2000 unsure of the impact of how computer data was written? That’s an excellent example of why data wrangling is such a big consideration for data scientists.

  • Data Set: This is a simple way of saying all the data in this group that we wish to analyze for our data science initiative.

  • Data Visualization: Graphics, charts, and maps are all types of data visualization, which is simply the pictorial or visual representation of data points. Visualizing data oftentimes makes it easier to identify trends, patterns, and outliers within a set of data.

  • Modeling: Data models dictate how data is stored and processed. When you have an outline of how data interacts with each other, you have a data model (i.e., outline) of how data points are connected and impact each other.

  • Data Science Model: More complex than simple data modeling, data science models don’t just show the relationship between items, they set out to understand data patterns and make predictions for the future (as a human would). This is achievable when a machine learning algorithm becomes familiar with data and able to predict future events based on what’s known. Let’s quickly explore three basic types of data science models you’ll often come across:

    • Linear Models: Linear models show the relationship between two variables. If you know at least one of the variables, you can predict the unknown variable due to its relationship with the existing piece of data. The resulting formula is able to be expressed in a straight line on a graph, for example when charting height and weight, those data points can be charted in a linear manner so as weight on a Y-axis increases, so too, does height on the X-axis.

    • Time Series Models: As its name implies, a time series model plots out data points based on when they happened sequentially in order to predict future behaviors. For example, a time series model could set out to predict how much money a company can make in a quarter based on time-plotted revenues from a previous period. 

    • Industry-Specific Models: These models are aligned with specific business need. and take into consideration common data points by industry, industry regulations, and best practices. One example of this is the Association for Retail Technology Standards creation of a Relational Data Model and Data Warehouse Model that is specific to the retail industry. 

  • Automation: If you’ve ever heard someone say that robots are taking over jobs, that’s an example of someone with a potential disdain for automation — or the minimization of human interaction to complete tasks. When machines are able to oversee the packaging of cookies in a warehouse or car assembly lines run automatically with the help of robots, those are common examples of automation (i.e., when a process or system operates independently and automatically).

  • Regression: In regression analysis, which is a type of linear modeling, data scientists set out to understand the relationship between two or more variables. One variable is constant (your dependent variable) and plotted against several other unknown variables (independent variables) to see how they affect each other. A simple example would be examining weight (dependent variable) and how it’s affected by diet, water intake, and exercise (independent variables).

  • Classification: Classification, also known as supervised learning, uses algorithms to group like items together into predefined groups based on their characteristics.

  • Clustering: Very similar to classification, clustering also means grouping like objects together, however, in this case, there are no predefined groups. Also known as unsupervised learning, clustering uses algorithms to determine what the relationship could be between objects.

Data science is a rapidly developing discipline that’s only predicted to grow in popularity in the coming years as more businesses harness the power of data to inform their business decisions. We hope the definitions above shed some light on data science’s core tenets and applications for businesses.

For further reading on dataset analysis and an alternate take on the four tiers of data analytics, read this thought leadership piece from our very own Brian Ray, Global Machine Learning, AI, and Cognitive Computing Firm Lead.

When you’re ready to get going on your data analysis, Maven Wave is here to help. Our data analytics and machine learning services are available to any enterprise that wants to use analytics to power its business. We employ a team of infrastructure specialists and data experts who can help deploy the underlying IT infrastructure and create and execute an analytics model that will set you up for measurable success. Contact us to get started today.

CONTACT US
January 27th, 2022
DATA SCIENCE

Get the latest industry news and insights delivered straight to your inbox.

Sign up for our Newsletter
2022-01-27T15:40:00-06:00