Building search functionality in products is a common task. Many solutions exist to solve this problem already. OpenSource tools like opensearch
and meilisearch
are some examples that are very commonly used. Using a 3rd party tool to build a “full-text search” is a good bet if you have a lot of data (i.e. a lot of users).
The goal of this post is to have a look at some in-built tools to build a minimalist search feature when your data is backed by a Postgres (or any SQL) database. Here is a rundown of what we will be covering
ILIKE
operatorSIMILAR TO
operator- POSIX Regular Expressions
- Fuzzy Searching
- Getting to indexes
- Full-Text Search using
tsvector
andtsquery
- Summary
- Resources
ILIKE
operator
The most basic and simplest approach to implement simple “pattern matching”. The ILIKE operator takes a pattern and returns results while ignoring the case of results.
SELECT * FROM my_table WHERE my_column ILIKE'% my pattern %';
The wildcard %
is used to match any number of characters. Similar to ILIKE is the LIKE
operator which returns results in case-sensitive mode.
This is easy to implement, but queries can easily become long when multiple columns are used which can be a hassle to maintain if you have to replicate a similar approach among multiple tables.
SELECT *
FROM my_table
WHERE my_column ILIKE '% my pattern %'
OR my_column2 ILIKE '% my pattern %'
OR my_column3 ILIKE '% my pattern %'
OR my_column4 ILIKE '% my pattern %'
OR my_column5 ILIKE '% my pattern %'
...
On top of that, custom enum types and dates need to be typecast, because ILIKE
only works on TEXT data type.
SELECT *
FROM my_table
WHERE
to_char(created_at, 'mm/dd/yy') ILIKE '% my pattern %'
OR custom_enum_type_column::text ILIKE '% my pattern %';
SIMILAR TO
operator
This operator lets us use regular expressions
in SQL queries and returns true/false if a match is found.
SELECT *
FROM my_table
WHERE my_column SIMILAR TO 'string%'='t';
Not very commonly used and is generally avoided, since introducing regex patterns in SQL queries is not an ideal solution.
POSIX Regular Expressions
POSIX regular expressions are a better way to match patterns than the ILIKE
and SIMILAR TO
operators. The most common example of regex usage is the grep
tool.
The operator ~
can be used to for POSIX regex matches. Where *
is used of case-insensitive matches & !
represents the logical NOT operation.
'pacenthink' ~ 'p.c' = true
'pacenthink' ~* 'thi' = true
'pacenthink' !~ 'pacenthink' = false
'pacenthink' !~* 'PACEANTHINK' = true
Fuzzy Searching
Trigram pg_trgm
A trigram is a group of three consecutive characters taken from a string.
We can measure the similarity of two strings by counting the number of trigrams they share.
To use the
pg_trgm
extension you will have to first enable it usingCREATE EXTENSION pg_trgm;
Let’s take a quick overview of some common trigram functions in Postgres.
show_trgm
Returns array of all the trigrams in the given string (no actual use, good for debugging).
SELECT show_trgm('pacenthink'); -- output [" p"," pa","ace","cen","ent","hin","ink","nk ","nth","pac","thi"]
similarity
Similarity ranges from 0 (not similar) to 1 (exact match).
SELECT similarity('pacenthink', 'think') -- output 0.30769232
Once you have the extension enabled, you can do similar word searches
SELECT * FROM my_table
WHERE SIMILARITY(my_column, 'word') > 0.3
Levenshtein Matching
The Levenshtein distance is the distance between two words, i.e., the minimum number of single-character edits required for both strings to match.
To use this functionality, we need to enable another extension called
fuzzystrmatch
usingCREATE EXTENSION fuzzystrmatch;
SELECT * FROM my_table WHERE levenshtein(my_column, 'MyStr') < 3
The above query will match all my_column
values which have a Levenshtein distance of less than 3 from the string MyStr
.
Getting to indexes
To index LIKE
& SIMILAR TO
the pg_trgm
module supports two PostgreSQL index types: GIST and GIN.
A
GiST
(Generalized Search Tree) index is lossy, meaning that the index may produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically). Use GIST index when you are storing data like longitudes, latitudes, ip address etc.CREATE INDEX index_name ON my_table USING GIST (my_column);
GIN
indexes are not lossy, but their performance depends logarithmically on the number of unique words.CREATE INDEX index_name ON my_table USING GIN (my_column gin_trgm_ops); -- for multiple columns CREATE INDEX index_name ON my_table USING GIN (my_column1, my_column2 gin_trgm_ops);
When using GIN indexes with
pg_trgm
, Postgres will split the row values into trigrams.
As a rule of thumb, a GIN index is faster to search than a GiST index, but slower to build or update; so GIN is better suited for static data and GiST for often-updated data.
Full-Text Search using tsvector
and tsquery
First, let’s look at the definition (source Wikipedia):
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. The full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.
This introduces a new term, “document” which is the unit of searching in a full-text search system; for example, an article or email message. Note that a “document” is not related to a table but rather to data, meaning we can have a document encompassing multiple columns and tables.
Full-text search in postgres allows for an efficient search for phrases and words within large amounts of text data. Different functions available to Postgres can be used to achieve full-text search:
- to_tsvector for creating a list of tokens (the tsvector data type, where ts stands for “text search”);
- to_tsquery for querying the vector for occurrences of certain words or phrases.
SELECT to_tsvector('The quick brown fox jumped over the lazy dog.');
-- output
'brown':3 'dogs':9 'fox':4 'jumped':5 'lazy':8 'over':6 'quick':2 'the':1,7
Full text searching in PostgreSQL is based on the match operator @@
, which returns true if a tsvector (document) matches a tsquery (query). It doesn’t matter which operator type is written first:
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('brown & fox');
-- output
TRUE
A tsquery
contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
SELECT to_tsquery('fat & rat');
-- output
'fat' & 'rat'
The following shows the difference between casting and using the function to_tsquery()
SELECT 'impossible'::tsquery, to_tsquery('impossible');
-- output
tsquery | to_tsquery
--------------+------------
'impossible' | 'imposs'
Summary
The best way to implement a basic (or even full-text search) for rows in order of thousands is to have a GIN index on columns required used either via tsvector
approach or using the ILIKE
operator.
Note that having any kind of index will slow down INSERTS, UPDATES, and DELETEs. For tables with an extremely high volume of transactions, we have to be very careful about adding indexes. For most tables in most systems, this is not an issue. In those cases opting for a solution like elastic search is a good bet.