Learn how Neon's autoscaling works - it estimates Postgres' working set size and keeps it in memory. Engineering post here
Postgres guides/Functions/String functions

Postgres regexp_match() function

Extract substrings matching a regular expression pattern

The Postgres regexp_match() function is used to extract substrings that match a regular expression pattern from a given string. It returns an array of matching substrings, including capture groups if specified in the pattern.

This function is particularly useful for complex string parsing tasks, such as extracting structured information from semi-structured text data. For example, it can be used to parse log files, extract specific components from URLs, or analyze text data for specific patterns.

Try it on Neon!

Neon is Serverless Postgres built for the cloud. Explore Postgres features and functions in our user-friendly SQL editor. Sign up for a free account to get started.

Sign Up

Function signature

The regexp_match() function has the following form:

regexp_match(string text, pattern text [, flags text]) -> text[]
  • string: The input string to search for matches.
  • pattern: A POSIX regular expression pattern to match against the string.
  • flags (optional): A string of one or more single-letter flags that modify how the regular expression is interpreted.

The function returns an array of text values, where each element corresponds to a substring within the first match of the pattern in the input string. If there are no matches, the function returns NULL. If there are no capture groups in the pattern, the array contains a single element with the full match.

Example usage

Consider a table log_entries with a log_text column containing log messages. We can use regexp_match() to extract specific information from these logs.

WITH log_entries AS (
  SELECT '[2024-03-04 10:15:30] INFO: User john_doe logged in from 192.168.1.100' AS log_text
  UNION ALL
  SELECT '[2024-03-04 10:20:45] ERROR: Failed login attempt for user jane_smith from 10.0.0.50' AS log_text
  UNION ALL
  SELECT '[2024-03-04 10:25:55] INFO: User admin logged out' AS log_text
)
SELECT
  regexp_match(log_text, '\[(.*?)\] (\w+): (.*)$') AS parsed_log
FROM log_entries;

This query extracts the timestamp, log level, and message from each log entry. The regular expression pattern \[(.*?)\] (\w+): (.*)$ captures three groups:

  1. The timestamp between square brackets
  2. The log level (INFO, ERROR, etc.), which is alphabetical and terminated with a colon
  3. The rest of the message
parsed_log
-----------------------------------------------------------------------------------------
 {"2024-03-04 10:15:30",INFO,"User john_doe logged in from 192.168.1.100"}
 {"2024-03-04 10:20:45",ERROR,"Failed login attempt for user jane_smith from 10.0.0.50"}
 {"2024-03-04 10:25:55",INFO,"User admin logged out"}
(3 rows)

Advanced examples

Use regexp_match() with regex flags

The regexp_match() function accepts optional flags to modify how the regular expression is interpreted. Here's an example using the 'i' flag for case-insensitive matching:

WITH user_agents AS (
  SELECT 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' AS user_agent
  UNION ALL
  SELECT 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Mobile/15E148 Safari/604.1' AS user_agent
  UNION ALL
  SELECT 'CHROME/91.0.4472.124' AS user_agent
)
SELECT
  regexp_match(user_agent, '(chrome|safari|firefox|msie|opera)\/[\d\.]+', 'i') AS browser
FROM user_agents;

This query extracts the browser name and version from user agent strings, using case-insensitive matching.

browser
----------
 {Chrome}
 {Safari}
 {CHROME}
(3 rows)

Use regexp_match() in a WHERE clause

You can use regexp_match() in a WHERE clause to filter rows based on a regex pattern:

WITH emails AS (
  SELECT 'john.doe@example.com' AS email
  UNION ALL
  SELECT 'jane.smith@company.co.uk' AS email
  UNION ALL
  SELECT 'support@mydomain.io' AS email
)
SELECT *
FROM emails
WHERE regexp_match(email, '^[^@]+@[^@]+\.(com|org|io)$') IS NOT NULL;

This query selects all rows from the emails table where the email address ends with .com, ``.org, or .io`.

email
----------------------
 john.doe@example.com
 support@mydomain.io
(2 rows)

Additional considerations

Performance implications

Using regexp_match() can be computationally expensive, especially on large datasets or with complex patterns. For better performance:

  1. Use simpler patterns when possible.
  2. Consider using LIKE or SIMILAR TO for simple pattern matching.
  3. If you frequently filter based on regex patterns, consider creating a functional index using the regexp_match() expression.

NULL handling

regexp_match() returns NULL if there's no match or if the input string is NULL. This behavior can be useful in WHERE clauses but may require careful handling in SELECT lists.

Alternative functions

  • regexp_matches(): Returns a set of all matches, useful for extracting multiple occurrences of the pattern in the input string.
  • regexp_replace(): Replaces substrings matching a regex pattern within a specified string.
  • regexp_split_to_array(): Splits a string using a regex pattern as the delimiter and returns the result as an array.
  • substring(): Extracts substrings based on a regex pattern similar to regexp_match(), but only returns the first captured group of the match.

Resources

Last updated on

Was this page helpful?