Even More SQL!

Record counts

CTEs (moved to sql2.html)

http://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/

OVER

Postgres information_schema and system catalog

Postgres pg_stats and query planning



Record counts

Suppose we want to peek at the first 20 records of a large dataset (say, bigcompany.employee, or passwords.password). We can do this:

select lname from employee limit 20;

What if we want the first 20, in sorted order?

select lname from employee order by lname limit 20;

What if we want the last 20? Here we have to know the total. There are 308 employees in bigcompany:

select count(*) from employee;

So the last of them are

select lname from employee offset 288;

But that's too much math. How about this:

select lname from employee offset ((select count(*) from employee) -20);



Random data

Suppose we want some random data. First we try this:

select generate_series(1,20) as id, 1000000*random();

Now let's store this in a table:

create table randos (
    id integer primary key,
    value integer
);

To populate the table, use

insert into randos select generate_series(1,20) as id, 1000000*random();

Note that the values are now integers because we asked for that in create table.

If when creating the table we declared id to be char(5), this still works:

create table randostr (
    id char(5) primary key,
    value integer
);

insert into randostr select generate_series(100000, 100200) as id, 1000000*random();


OVER and window queries

Sometimes we'd like to use aggregation functions but with more flexibility than provided by GROUP BY. Recall that GROUP BY runs the aggregation functions over each group. Let's warm up with some examples:

select sum(salary) from employee;

select dno, salary from employee;

select dno, sum(salary) from employee group by dno;

This one fails, because it has no group-by clause:  select dno, salary, sum(salary) from employee;

But this works:

select dno, salary, sum(salary) over() from employee;

What is it doing? Applying the sum(salary) function to all salaries. This isn't necessarily what we're looking for. But how about this:

select dno, lname, salary, sum(salary) over(partition by dno) from employee;

Now, we have a row for each employee, but the last column is a sum of that employee's department's salaries. This cannot be done with group by (at least without an inner query), because if we group by dno, we can't have an output record for each employee.

The partition clause, inside the over(), determines what we're taking the sum of. We find the partition bucket that the current record belongs to (that is, we find the employee's dno), and take the sum over all salaries in that bucket.

There's also an order by clause, for things within the bucket.

select lname, salary, sum(salary) over (order by lname) from employee;
select lname, salary, sum(salary) over (order by salary desc) from employee;

In each case the sum(salary) is cumulative. The order-by clause determines the order of addition in the cumulative sum.

In the next two examples the rows we're summing are spelled out explicitly; in each case we are doing a cumulative sum. In the first case the sum is cumulative over all employees; in the second it is cumulative by department.

select lname, salary, sum(salary) over (order by lname rows between unbounded preceding and current row) from employee;

select lname, dno, salary, sum(salary) over (partition by dno order by lname rows between unbounded preceding and current row) from employee;

The role of sum() with an over() clause is quite different from its role as an aggregation function. There are a set of functions that can be used with over(), known as window functions. Many apply to the entire partition, or (in the presence of an order clause) from the beginning of the partition to the current record. Here's an example:

select lname, row_number() over(order by lname) from employee;

What happens if we omit the over()?

The median

How about finding the median? We can't reference row_number() directly in the where clause, but we can use a CTE:

with temp(lname, salary, rn) as (select lname, salary, row_number() over (order by salary) from employee)
select * from temp where rn = (select count(*) from employee)/2;

select lname, salary, row_number() over (order by salary) from employee where rn = (select count(*) from employee)/2;

Some window functions apply to a "window frame", that is, a defined range within the ordered partition. We define this window frame with the between clause. Here's an example in which each employee is listed with the average salary of the employee, the previous employee and the following employee:

select lname, salary, avg(salary) over (order by salary rows between 1 preceding and 1 following) from employee;



Using the Postgres information_schema and system catalogs

The information_schema is more or less standardized; the Postgres system catalog (schema pg_catalog) is postgres-specific. We can see these schemas with the psql command \dnS. Though we'll refer to the objects in these schemas as tables, many of them are actually views. While reading from them is reasonable, updates to system tables can destroy your database.

These tables can shed light on

For some of these examples, the following psql option is useful:
    \x auto
It forces the printing of overly wide tables in a more vertical format. Disable with \x off.

Some references:

All the postgres commands, like \d and \d table, get their information from information_schema and pg_catalog queries.

Information about Tables

First we'll use the table information_schema.columns:

select column_name, table_schema, data_type from information_schema.columns where table_name = 'employee';

select * from information_schema.columns where table_name = 'employee' and column_name = 'ssn';

This gives the "normal" columns to table "employee", but does not give the system columns. For that we'll use the pg_class table. We can see the columns of this table with \d pg_class.

select oid, relpages, reltuples FROM pg_class WHERE relname = 'employee';

This is the output (on my system. right now.)

  oid  | relpages | reltuples
-------+----------+-----------
 16386 |        1 |         9

The oid (not printed with "select *"!) is the so-called table oid: an internal identifier of the table. The other two columns are estimates of the number of pages and number of tuples.

Now let's try a query that is comparable to the above information_schema.columns query.

select a.attname, a.atttypid, t.typname from pg_class c join pg_attribute a on c.oid=a.attrelid join pg_type t on t.oid = a.atttypid where c.relname = 'employee';

Here's the output.

  attname  | atttypid | typname
-----------+----------+---------
 tableoid  |       26 | oid
 cmax      |       29 | cid
 xmax      |       28 | xid
 cmin      |       29 | cid
 xmin      |       28 | xid
 ctid      |       27 | tid
 fname     |     1043 | varchar
 minit     |     1042 | bpchar
 lname     |     1043 | varchar
 ssn       |     1042 | bpchar
 bdate     |     1082 | date
 address   |     1043 | varchar
 sex       |     1042 | bpchar
 salary    |     1700 | numeric
 super_ssn |     1042 | bpchar
 dno       |       23 | int4


Note the first six.


More from pg_catalog

select datname from pg_database;

select datname, datdba, encoding from pg_database order by datname;

How about the name and owner of databases? This works, if you have DBA privileges:

select d.datname, d.datdba, a.rolname, pg_encoding_to_char(d.encoding) from pg_database d join pg_authid a on d.datdba = a.oid order by d.datname;

But if you're not a privileged user, use pg_roles instead of pg_authid; the former does not have the rolpassword field (which can be used to hack passwords).

select d.datname, d.datdba, a.rolname, pg_encoding_to_char(d.encoding) from pg_database d join pg_roles a on d.datdba = a.oid order by d.datname;


Information on permission grants

Access to tables is created through specific "grants". Here we view the grants on a table:

select grantee, privilege_type from information_schema.role_table_grants
where table_name='employee';

Here are the grantors and grantees:

select * from information_schema.table_privileges where table_name = 'employee';
select * from information_schema.column_privileges where table_name = 'employee' and column_name = 'ssn';



pg_stats

This is where Postgres keeps its information on where-clause selectivity. Some information: postgresql.org/docs/9.6/static/row-estimation-examples.html.

The pg_stats table is where postgres stores the statistics it has gathered to assist in query planning. The histogram_bounds attribute contains a histogram describing the data distribution. If histogram_bounds is null, there is no histogram data available to the query planner! We'll start with the Dellstore database (named "dell" on my machine):

select tablename, attname from pg_stats where schemaname = 'public' and histogram_bounds is null;

  tablename  |       attname       
------------+----------------------
 orderlines | quantity
 orderlines | orderlineid
 products   | special
 products   | price                  <--- not integer
 products   | category
 customers  | gender
 customers  | income
 customers  | age
 customers  | password
 customers  | creditcardexpiration
 customers  | creditcardtype
 customers  | region
 customers  | country
 customers  | state
 customers  | address2

Now let's see which attributes do have histogram data:

select tablename, attname from pg_stats where schemaname = 'public' and histogram_bounds is not null;


 tablename  |    attname    
------------+----------------
 orderlines | orderid
 orderlines | prod_id
 orderlines | orderdate
 inventory  | prod_id
 inventory  | quan_in_stock
 inventory  | sales
 products   | prod_id
 products   | title
 products   | actor
 products   | common_prod_id
 cust_hist  | customerid
 cust_hist  | orderid
 cust_hist  | prod_id
 customers  | customerid
 customers  | firstname
 customers  | lastname
 customers  | address1
 customers  | city
 customers  | zip
 customers  | email
 customers  | phone
 customers  | creditcard
 customers  | username
 orders     | orderid
 orders     | orderdate
 orders     | customerid
 orders     | netamount
 orders     | tax
 orders     | totalamount

Let's look at a couple of these:

select histogram_bounds from pg_stats where tablename = 'orders' and attname = 'totalamount';
select histogram_bounds from pg_stats where tablename = 'customers' and attname = 'zip';

Now let's try some queries on these attributes, with differing levels of selectivity:

explain select * from customers where customerid between 2000 and 2400; // index scan
explain select * from customers where customerid between 2000 and 15000;    // still an index scan
explain select * from customers where customerid between 2000 and 20000;

The cutoff seems to be somewhere around 15893. [!]

What if we try this with zip:

explain select * from customers where zip between 60000 and 62000;

Selectivity doesn't matter as there's no index on zip. We could add it, though.

Now let's combine two where conditions:

explain select * from customers where zip between 60000 and 62000 and customerid between 2000 and 2400;

We start with an index scan on customerid.

explain select * from customers where firstname between 'P' and 'Q';

We get a sequential scan. There's no index on firstname.

explain select * from customers where zip between 60000 and 62000 and firstname between 'P' and 'Q';

Again, there is no useful index.


Next we run analyze select. This actually runs the query.

explain analyze select * from customers where zip between 60000 and 62000 and firstname between 'P' and 'Q';

                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
 Seq Scan on customers  (cost=0.00..888.00 rows=9 width=268) (actual time=0.241..3.498 rows=13 loops=1)
   Filter: ((zip >= 60000) AND (zip <= 62000) AND ((firstname)::text >= 'P'::text) AND ((firstname)::text <= 'Q'::text))
   Rows Removed by Filter: 19987
 Planning time: 0.158 ms
 Execution time: 3.520 ms


Note that the estimator thinks there will be 9 rows, but the actual number of rows is 13.


Here's another:

explain analyze select * from customers where zip between 60000 and 70000 and firstname between 'P' and 'T';

                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
  Seq Scan on customers  (cost=0.00..888.00 rows=174 width=268) (actual time=0.579..47.226 rows=185 loops=1)
   Filter: ((zip >= 60000) AND (zip <= 70000) AND ((firstname)::text >= 'P'::text) AND ((firstname)::text <= 'T'::text))
   Rows Removed by Filter: 19815
 Planning time: 0.138 ms
 Execution time: 47.303 ms

The estimator thinks 174 but really there are 185.