Somehow extracting summary statistics of tables quickly using SQL
I worked with a large table, with a thousand numeric columns, and some of them were dodgy and I wanted to know which. It would help to find basic summary statistics, like the min, max and mean of each column, so let’s do that now.
What to do? Link to heading
I really didn’t want to do this:
select
min(c1),
max(c1),
mean(c1),
min(c2),
max(c2),
mean(c2),
min(c3)
-- and so on for 1000 columns
from xyz
The alternate idea: write SQL, that generates SQL!
What does this mean? Here’s a code snippet, written in PostgreSQL:
SELECT 'select \'spacer\' AS column_name, 0 AS min_value, 0 as avg_value,
0 as max_value, 0 as n_nulls, 0 as n_non_nulls, 0 as n_rows' union all
SELECT
'union all select \'' || "column" || '\', min(' ||
"column" || '), avg(cast(' ||
"column" || ' as float)), max(' ||
"column" || '), sum(case when ' ||
"column" || ' is null then 1 else 0 end) n_nulls, count(1) - sum(case when ' ||
"column" || ' is null then 1 else 0 end) n_non_nulls, count(1) n_rows from ' || tablename
FROM pg_table_def
WHERE tablename = 'XYZ'
and type in ('integer', 'real');
If you run this statement, the output will be a table. Copy-paste the table into a new editor and you’ll see a SQL statement. Run this statement to get your summary statistics! Just remember to modify the WHERE clause to use the table name you are interested in.
How does it work? The statement uses the pg_table_def
table to find the column names in your table. It checks which ones are numeric, then constructs an elaborate string concatenation to generate a SQL query.
No typing required! Once you get the basic idea, you can extend the code above to only include particular metrics that you are interested. Above we find the
- min
- max
- average
- number of nulls
- number of non-nulls
- total number of rows
for each column, but there are plenty more that you could include, like
- number of zeros
- median
- standard deviation / variance
- distribution
- percentiles (5th percentile, 25th percentile, 75th percentile, 95th percentile)
- percentage of outliers (numbers over a certain value
Just be careful if you have large data. The metrics that require sorting to work will be slow (like the median or percentiles) so you might have to do without those (or use a sortkey).
Benefits Link to heading
I’m finding this method of writing SQL very useful. I’ve found four main benefits:
- I type way less SQL
- I make less mistakes in my queries
- I’m way more likely to calculate summary statistics for my tables, which leads to
- being more likely to find errors in my data
This last point is the big one.