Count words and phrases in a column MySQL/SQL
In this post:
- MySQL Count words in a column per row
- MySQL Count total number of words in a column
- Explanation
- SQL standard version and phrases
- Performance
- Resources
If you want to count phrases or words in MySQL (or SQL) you can use a simple technique like:
SELECT description, LENGTH(description) - LENGTH(REPLACE(description, ' ', '')) + 1 FROM test.city
Let say that we have this table:
name | description |
---|---|
Hong Kong | Hong Kong , officially the Hong Kong Special Administrative Region of the People's Republic of China |
Macau | Macau, officially the Macao Special Administrative Region of the People's Republic of China |
Beijing | Beijing formerly romanized as Peking, is the capital of the People's Republic of China |
MySQL Count words in a column per row
If you want to get the number of words in a column you can do a simple trick like:
- count the length of the column
- count the spaces in column
- extract the first from the second
SELECT description,
LENGTH(description) - LENGTH(REPLACE(description, ' ', '')) + 1
FROM test.city
result:
Hong Kong , officially the Hong Kong Special Administrative Region of the People's Republic of China | 16 |
Macau, officially the Macao Special Administrative Region of the People's Republic of China | 13 |
Beijing formerly romanized as Peking, is the capital of the People's Republic of China | 14 |
the similar result would be for the other column:
SELECT name,
LENGTH(name) - LENGTH(REPLACE(name, ' ', '')) + 1
FROM test.city
result:
Hong Kong | 2 |
Macau | 1 |
Beijing | 1 |
MySQL Count total number of words in a column
You can easily get the number of words in a column by using the previous query and the aggregation function sum:
SELECT description,
sum(LENGTH(description) - LENGTH(REPLACE(description, ' ', '')) + 1)
FROM test.city
result:
43
You can expand the query and calculate for the whole table with Union all(MySQL syntax):
select sum(len) from (
SELECT
LENGTH(description) - LENGTH(REPLACE(description, ' ', '')) + 1 as len
FROM test.city
union all
SELECT
LENGTH(name) - LENGTH(REPLACE(name, ' ', '')) + 1 as len
FROM test.city
) as tablen
result:
47
Explanation
In order to understand how this queries are working we need to divide it into parts. For example counting the words in column name can be done in this way:
SELECT
name,
LENGTH(name),
LENGTH(REPLACE(name, ' ', '')),
LENGTH(name) - LENGTH(REPLACE(name, ' ', '')) + 1
FROM test.city
the result of this query is:
Hong Kong | 9 | 8 | 2 |
Macau | 5 | 5 | 1 |
Beijing | 7 | 7 | 1 |
As you can see we are getting the total length of characters in this column and later we calculate the same but this time without the white-spaces ( as natural separator for words). At the end we are adding 1 for the first word.
This technique can be applied for different values and separators.
SQL standard version and phrases
This is tested in two DB - MySQL and Oracle. The same should work also in Maria DB without changes. For the rest of the DB the syntax flavor could differ a bit but the same logic can be applied.
You can calculate the same way different phrases separated by coma, dots or semicolons.
Performance
This operation can take some time depending on your data, server and number of rows. On a Oracle server for a table with 1 million rows calculating the count for column with length between 2 and 7 it takes 5 seconds to extract the full result set of this operation.
For a table with 2 million rows and same length it took roughly 15 seconds to complete the same query. Similar time is measured in MySQL.