February 27, 2025

The Power Tools For Data Manipulation Series: 11 Practical Formulas For Text Analysis

February 27, 2025

Data isn’t always neatly structured. In fact, most of it isn’t. Whether you’re working with messy customer feedback, raw survey responses, or inconsistently formatted product catalogs, text analysis can help transform scattered information into something useful.

This is the next installment in our Power Tools for Data Manipulation series, diving into the techniques that make handling text data easier and more efficient. If you’ve ever struggled to extract meaningful insights from a jumble of words or spent hours manually cleaning data, this guide is for you. If you missed the first two, check them out here: Power Query, Advanced Formulas, and Conditional Logic.

We’ll cover how text extraction, pattern matching, concatenation, and advanced formatting methods can simplify everything from cleaning up messy datasets to automating reporting. You’ll also get practical insights into using regular expressions (regex), wildcards, and text-to-columns, turning raw text into structured, decision-ready data.

Let’s break it down.

The role of text analysis in business analytics

Data isn’t just numbers. Much of it comes in text form: customer feedback, email addresses, product descriptions, financial statements, and more. Businesses rely on text data to find insights and automate workflows, but raw text is rarely useful in its original form. That’s where text analysis helps.

Analyzing and structuring text makes it easier to standardize formats, fix inconsistencies, and extract valuable details. From customer sentiment analysis to data cleaning and categorization, text analysis helps teams organize unstructured information, detect patterns, and reduce errors.

Text extraction techniques

Text data often contains valuable details that aren’t immediately accessible. Whether you need to pull product codes from descriptions, extract domain names from email addresses, or separate first and last names, text extraction functions help isolate what matters.

Spreadsheet tools provide several functions to extract specific portions of text based on position, length, or dynamic search criteria. These techniques simplify data cleaning and formatting, making analysis more efficient.

Using LEFT, RIGHT, and MID functions to extract substrings

The LEFT, RIGHT, and MID functions allow you to pull sections of text based on character position.

LEFT(text, num_chars) extracts characters starting from the left.
RIGHT(text, num_chars) pulls text from the right side. This is useful for grabbing file extensions or suffixes.
MID(text, start_num, num_chars) extracts a segment from the middle of a string, starting at a specified position. Perfect for isolating specific segments, like area codes in phone numbers.

For example, if a product ID follows a set format like "ABC-12345," using LEFT(A1,3) returns "ABC," while RIGHT(A1,5) retrieves "12345."

Leveraging FIND and SEARCH functions for dynamic extraction

While LEFT, RIGHT, and MID work with fixed positions, sometimes, you need to extract text based on patterns rather than fixed positions. That’s where FIND and SEARCH come in. Both functions locate specific text within a string, but there’s a key difference: FIND is case-sensitive, while SEARCH is not.

These functions are often paired with MID or RIGHT to dynamically extract substrings. For instance, extracting a domain from an email address involves using FIND("@",A1) to locate the @ symbol’s position, then applying MID to retrieve everything after it.

Practical examples of text extraction in business analytics

Text extraction is widely used to clean and structure data for analysis. Some typical applications include:

Extracting invoice numbers from transaction records. Businesses often store invoice details in long text fields that include dates, client names, and payment statuses. Using text extraction functions helps isolate just the invoice number for reconciliation and reporting.
Identifying error codes from system logs. IT teams analyze logs containing error messages, timestamps, and device IDs. Extracting specific error codes makes diagnosing and addressing system issues easier.
Retrieving SKU categories from product descriptions. When product names include embedded details (e.g., “Winter Jacket - XL - Blue”), extracting the category (“Winter Jacket”) helps streamline inventory classification.

Text extraction might seem simple, but it’s the foundation of many advanced data manipulation tasks, and mastering these techniques will save you time.

Pattern matching with wildcards

Wildcards are like the Swiss Army knife of text analysis. They allow you to perform flexible searches across datasets, making it easier to filter and manipulate text efficiently.

Understanding wildcards in spreadsheet formulas

Wildcards are special characters that represent unknown or variable text. They act as placeholders in text-based formulas, making it easier to search for patterns rather than exact matches. The three most common wildcard characters are:

Asterisk (*): Represents any number of characters. For example, COUNTIF(A:A, "app*") counts all entries starting with "app," like “apple,” “application,” or “appetizer,” regardless of what follows.
Question mark (?): Stands in for a single character. COUNTIF(A:A, "b?ll") matches "ball" and "bill."
Tilde (~): Escapes special characters when you need to search for an actual asterisk or question mark.

Let’s say you’re analyzing a customer support ticket dataset, and some customers include a question mark (?) in their issue descriptions:

If you use COUNTIF(A:A, "*?"), it does not specifically search for question marks. Instead, it matches any text with at least one character before any wildcard pattern, which could lead to unintended matches.

To find only support tickets that contain a literal question mark, use COUNTIF(A:A, "*~?"). This formula only counts rows that end with a question mark (?), such as:
"Where is my order?" and "How do I return this?"

Applying this logic to filter or highlight only question-based inquiries in a large dataset helps separate questions from general complaints or statements.

Applying wildcards in filtering and lookup functions

Wildcards are particularly useful in functions like COUNTIF, SUMIF, and VLOOKUP when working with datasets with minor naming convention inconsistencies. Here’s how:

Using COUNTIF to identify partial matches: If a customer feedback dataset includes phrases like "Order delayed," "Order issue," and "Order lost," the formula COUNTIF(A:A, "Order*") counts all rows that start with "Order," regardless of what follows.
Using SUMIF with wildcards: If a company's expense report lists variations like "Office Supplies - Ink" and "Office Supplies - Paper," SUMIF(A:A, "Office Supplies*", B:B) calculates the total for all expenses under "Office Supplies", regardless of the specific item.
Enhancing VLOOKUP searches: Instead of requiring an exact match, searching for employee names with slight variations such as "Smith, John" vs. "Smith, John A." can be handled with: VLOOKUP("Smith, John*", A:B, 2, FALSE). This retrieves data for any variation of "Smith, John," ensuring consistency even when middle initials or suffixes are present.

Optimizing pattern recognition for data validation

Wildcards can also streamline data cleaning and validation. For instance, you can use them to identify entries that don’t follow a specific format, such as phone numbers missing hyphens or dates in the wrong order.

They’re also incredibly useful for detecting errors or inconsistencies in large datasets, like spotting typos or misplaced characters. While wildcards might seem like a minor feature, they pack a big punch in text analysis, making them an essential tool for anyone working with messy or unstructured data.

String concatenation methods

Combining text from multiple cells is essential when formatting data, creating structured outputs, or generating reports. Whether you’re building email templates, standardizing product descriptions, or merging customer names, string concatenation helps organize text efficiently.

Using CONCAT and TEXTJOIN to merge text efficiently

Modern spreadsheet tools provide several ways to join text:

CONCAT: Combines two or more strings into one but doesn’t add spaces or delimiters automatically. For example, CONCAT(A2, B2, C2) combines three cells into one continuous string. Simple but limited.
TEXTJOIN: A more flexible function that lets you specify a delimiter between values. TEXTJOIN(", ", TRUE, A2:C2) merges text with commas while ignoring empty cells. For example, TEXTJOIN can merge a list of names into a single cell, separated by commas.

Both functions reduce the need for repetitive formulas, making it easier to handle large datasets without manually adding separators.

Formatting concatenated strings for readability

Concatenated text can quickly become difficult to read if not properly formatted. One of the best practices for improving readability is adding spaces or delimiters between merged values. For example, instead of displaying a name as "JohnDoe", using TEXTJOIN(" ", TRUE, A2:B2) ensures the name appears as "John Doe", making it easier to understand.

When working with structured reports, inserting line breaks can also enhance clarity. Adding CHAR(10) within a formula allows text to be separated into multiple lines, making long entries more readable when viewed in applications supporting line wrapping. Additionally, handling missing data is crucial in concatenation. Since empty cells can create awkward gaps, TEXTJOIN automatically skips blank values, ensuring a clean and consistent output.

String concatenation is important in business analytics, particularly in organizing and presenting data. A common use case is report generation, where fields such as date, region, and product type are merged into structured summaries for easier review. Another practical application is email personalization, where customer details like names, order numbers, and support messages are dynamically inserted into automated communications, making them more relevant to recipients.

Concatenation is also essential for dataset standardization, particularly when combining address components into a uniform format for mailing lists or CRM systems. By structuring text effectively, analysts can improve data organization and streamline reporting processes, ensuring information remains clear and actionable.

Regular expressions in spreadsheets

Regular expressions (regex) allow for advanced text pattern matching, making extracting, validating, and cleaning text data easier. Unlike basic search functions, regex provides precise control over identifying and manipulating patterns. This is especially useful when working with large datasets containing varying formats, like customer emails, phone numbers, or invoice codes.

What is regex in spreadsheet and BI apps?

Regex is a sequence of characters that defines a search pattern. It’s supported in many spreadsheet and BI tools, allowing you to perform sophisticated text manipulations. Unlike wildcard searches, which focus on broad matching, regex can define highly specific rules for text extraction. Some common regex elements include:

\d: Matches any digit (0-9). Useful for extracting numbers from text.
\w+: Matches whole words. Helps identify key terms in unstructured text.
[A-Z]{3}: Finds three consecutive uppercase letters, often used for detecting currency codes or product abbreviations.

Analysts can automate text extraction and validation without manually sifting through data by incorporating regex into spreadsheet formulas or BI tools.

Common regex functions for text extraction and validation

Regex is particularly effective for extracting specific types of text from large datasets. One common application is extracting emails, where a regex formula can isolate properly formatted email addresses while ignoring incomplete or invalid entries.

This is especially useful when dealing with unstructured data from user submissions or logs. Regex also helps validate phone numbers, ensuring they adhere to a consistent format to reduce entry errors and improve data quality. Additionally, regex can find specific word patterns, allowing analysts to flag mentions of key terms in survey responses, customer feedback, or product reviews.

Automating data cleaning with regex

Manual text cleaning is time-consuming and prone to errors, but regex can automate many repetitive tasks. One major benefit is standardizing formats, where inconsistent date styles, address variations, or product codes can be transformed into a uniform structure.

Regex also simplifies removing unwanted characters, such as extra spaces, special symbols, or duplicated words that may interfere with analysis. Furthermore, regex is useful for flagging inconsistencies, helping identify missing fields or mismatched values before running reports or making decisions.

When applied effectively, regex simplifies text processing, reducing manual effort and allowing businesses to focus on extracting meaningful insights rather than cleaning messy data. Regex might have a learning curve, but it’s worth the effort for anyone working with text data.

Text-to-columns advanced usage

When working with large datasets, information is often stored in a single column, making it difficult to analyze or format properly. The text-to-columns feature is a powerhouse for breaking down complex text entries into structured components, making data cleaner and easier to work with.

Beyond basic text splitting: advanced applications

Most users apply text-to-columns for simple tasks like splitting names or separating addresses, but the tool can handle much more.

Using delimiters: Separates text based on commas, spaces, tabs, or custom characters. This is useful for processing CSV files or standardizing multi-part product descriptions.
Fixed-width settings: Extracts segments of text when values follow a strict character count, such as structured invoice numbers or customer IDs.
Dynamic text splitting: When combined with formulas, text-to-columns can adapt to changing data structures, ensuring ongoing accuracy.

Combining text-to-columns with formulas for automation

To take text splitting a step further, pairing text-to-columns with functions like LEFT, RIGHT, MID, and FIND can create dynamic, reusable solutions. Instead of applying it manually each time, formulas can automatically extract key text components:

Separating first and last names dynamically: Using LEFT(A1, FIND(" ", A1)-1) for first names and RIGHT(A1, LEN(A1)-FIND(" ", A1)) for last names.
Extracting numerical values from mixed data: Applying MID and FIND to pull ID numbers from text strings that combine letters and digits.

Practical applications of text-to-columns in business analytics

Businesses rely on structured text for various operations, making text-to-columns an essential tool for organizing data efficiently. One common use case is processing CSV files, where large datasets need to be broken down into individual fields for better management and analysis. These files can be difficult to navigate without proper formatting, but text-to-columns ensures that each data point is correctly placed in its designated column.

Another valuable application is standardizing address data, a crucial step for maintaining accurate shipping records and customer databases. Address components such as street names, city, state, and postal codes often exist in a single field, making it challenging to sort or filter by location. Text-to-columns simplifies this process by separating these details into structured fields, improving consistency across records.

Text-to-columns also helps with extracting customer information from email signatures or unstructured contact lists. Analysts can pull key details like company names, departments, and job titles into individual fields, making it easier to segment customer data and personalize outreach efforts.

By leveraging text-to-columns effectively, businesses can reduce manual cleanup, improve data accuracy, and ensure datasets are structured for deeper analysis.

Mastering text analysis tools

Cleaning and structuring text data directly impacts the quality of analysis, reporting, and automation. Whether you're extracting key details, matching patterns, merging text, or splitting it into structured components, mastering these techniques can save time and reduce errors.

By applying text extraction, wildcards, concatenation, regular expressions, and text-to-columns, analysts can transform raw text into a format that supports better decision-making. These tools help streamline workflows, minimize manual cleanup, and ensure consistency across datasets. As businesses rely more on text-heavy data, knowing how to manipulate and refine it gives analysts an edge. The better your text data is structured, the easier it is to extract meaningful insights.

Text analysis commands: Frequently asked questions

How do I extract specific text from a string in a spreadsheet?

You can use functions like LEFT, RIGHT, MID, FIND, and SEARCH to pull specific portions of text.

What is the difference between FIND and SEARCH functions?

Both functions locate text within a string, but FIND is case-sensitive, while SEARCH is not. If you need an exact match, use FIND. Use SEARCH if you want a more flexible search that ignores case differences.

When should I use wildcards instead of regular expressions?

Wildcards are useful for quick, simple pattern matching in spreadsheet formulas like COUNTIF and VLOOKUP, while regular expressions allow more advanced, precise text extraction and validation. Regex is usually the better option if your task involves structured formats, such as phone numbers or invoice codes.

How can I concatenate text dynamically in a dataset?

Use TEXTJOIN to merge text from multiple cells while adding spaces or other separators and skipping any empty cells.

‍

2025 Gartner® Magic Quadrant™

Sigma is on the Gartner® Magic Quadrant™

Data Modeling

Data Analytics