Count words and phrases in a column MySQL/SQL

In this post:

If you want to count phrases or words in MySQL (or SQL) you can use a simple technique like:

SELECT description, LENGTH(description) - LENGTH(REPLACE(description, ' ', '')) + 1 FROM test.city

Let say that we have this table:

name description
Hong Kong Hong Kong , officially the Hong Kong Special Administrative Region of the People's Republic of China
Macau Macau, officially the Macao Special Administrative Region of the People's Republic of China
Beijing Beijing formerly romanized as Peking, is the capital of the People's Republic of China

MySQL Count words in a column per row

If you want to get the number of words in a column you can do a simple trick like:

  • count the length of the column
  • count the spaces in column
  • extract the first from the second
SELECT description, 
LENGTH(description) - LENGTH(REPLACE(description, ' ', '')) + 1
FROM test.city

result:

Hong Kong , officially the Hong Kong Special Administrative Region of the People's Republic of China 16
Macau, officially the Macao Special Administrative Region of the People's Republic of China 13
Beijing formerly romanized as Peking, is the capital of the People's Republic of China 14

the similar result would be for the other column:

SELECT name, 
LENGTH(name) - LENGTH(REPLACE(name, ' ', '')) + 1
FROM test.city

result:

Hong Kong 2
Macau 1
Beijing 1

MySQL Count total number of words in a column

You can easily get the number of words in a column by using the previous query and the aggregation function sum:

SELECT description, 
sum(LENGTH(description) - LENGTH(REPLACE(description, ' ', '')) + 1)
FROM test.city

result:

43

You can expand the query and calculate for the whole table with Union all(MySQL syntax):

select sum(len) from (
SELECT  
LENGTH(description) - LENGTH(REPLACE(description, ' ', '')) + 1 as len
FROM test.city
union all
SELECT  
LENGTH(name) - LENGTH(REPLACE(name, ' ', '')) + 1 as len
FROM test.city
) as tablen

result:

47

Explanation

In order to understand how this queries are working we need to divide it into parts. For example counting the words in column name can be done in this way:

SELECT 
	name, 
	LENGTH(name), 
	LENGTH(REPLACE(name, ' ', '')), 
	LENGTH(name) - LENGTH(REPLACE(name, ' ', '')) + 1
FROM test.city

the result of this query is:

Hong Kong 9 8 2
Macau 5 5 1
Beijing 7 7 1

As you can see we are getting the total length of characters in this column and later we calculate the same but this time without the white-spaces ( as natural separator for words). At the end we are adding 1 for the first word.

This technique can be applied for different values and separators.

SQL standard version and phrases

This is tested in two DB - MySQL and Oracle. The same should work also in Maria DB without changes. For the rest of the DB the syntax flavor could differ a bit but the same logic can be applied.

You can calculate the same way different phrases separated by coma, dots or semicolons.

Performance

This operation can take some time depending on your data, server and number of rows. On a Oracle server for a table with 1 million rows calculating the count for column with length between 2 and 7 it takes 5 seconds to extract the full result set of this operation.

For a table with 2 million rows and same length it took roughly 15 seconds to complete the same query. Similar time is measured in MySQL.

Resources

Related Article