Back to the front page

How to stop worrying and love your Postgres indexes

In this article we will explore how to get the most out of indexes in our applications and how to fully exploit their immense power. Hopefully, after reading this you’ll be a member of the Sacred Order of the SQL Indexing Gurus™.

I feel that this comic is somehow relevant.

XKCD 208

O Index, Who Art Thou?

So, what’s an index in the first place? An index makes our query really fast. No wait, scrap that. An index makes our queries really really fast.

An index will require its own disk space and will help the database access the indexed data in a more structured way. You can think of an index as if it were a book index: when we are looking for a certain topic, we usually don’t read the whole book from the beginning to the end.. Instead, we reach for the index, quickly scan through its pages and get the page where we can find the information we’re looking for.

In the case of our index, we have a book which is constantly changing. So, the database has always to keep the index updated in order to use it to make our queries faster. We all know there is no such thing as free lunch: keeping indexes up-to-date is an expensive operation and adding too many useless indexes to your tables will make any INSERT, UPDATE and DELETE operations very slow.

Let's Get Started

So, let’s create a table first:

CREATE TABLE projects (  
  id bigserial primary key,
  name text,
  priority varchar(20) NOT NULL,
  created_at timestamp
);

Then, we can create a function to generate random strings:

CREATE FUNCTION random_string(minlen NUMERIC, maxlen NUMERIC)  
RETURNS VARCHAR(1000)  
AS  
$$
DECLARE  
  rv VARCHAR(1000) := '';
  i  INTEGER := 0;
  len INTEGER := 0;
BEGIN  
  IF maxlen < 1 OR minlen < 1 OR maxlen < minlen THEN
    RETURN rv;
  END IF;

  len := floor(random()*(maxlen-minlen)) + minlen;

  FOR i IN 1..floor(len) LOOP
    rv := rv || chr(97+CAST(random() * 25 AS INTEGER));
  END LOOP;
  RETURN rv;
END;  
$$ LANGUAGE plpgsql;

Now let's insert some data inside our table:

INSERT INTO projects (  
  name, priority, created_at
) SELECT initcap(lower(random_string(10, 12))),
         ('[0:3]={low, medium, high, urgent}'::text[])[trunc(random()*4)],
         date
  FROM generate_series('2015-01-01'::timestamp, '2015-01-31', '5 seconds') date;

You should have added quite a lot of projects :)

SELECT COUNT(*) FROM projects;  
 count
--------
 518401
(1 row)

Let's take a look at how the rows look like:

SELECT * FROM projects LIMIT 10;  
 id |    name     | priority |     created_at
----+-------------+----------+---------------------
  1 | Dnlcplzeaa  | low      | 2015-01-01 00:00:00
  2 | Osclffksyq  | urgent   | 2015-01-01 00:00:05
  3 | Jxljdoogdyi | high     | 2015-01-01 00:00:10
  4 | Ivliyvwrxi  | urgent   | 2015-01-01 00:00:15
  5 | Sqbjkpsibd  | medium   | 2015-01-01 00:00:20
  6 | Qrsqalbihmp | medium   | 2015-01-01 00:00:25
  7 | Myhuwjnnlw  | urgent   | 2015-01-01 00:00:30
  8 | Pgcssrjkka  | medium   | 2015-01-01 00:00:35
  9 | Crdohjxtieo | low      | 2015-01-01 00:00:40
 10 | Bsyxqznwcgp | urgent   | 2015-01-01 00:00:45
(10 rows)

Awesome, now we can add our own project, which we will use to test our indexes:

UPDATE projects  
   SET name='Manhattan'
 WHERE id=1337;

A Simple Index

Let's start by finding our project:

SELECT * FROM projects WHERE name = 'Manhattan';  

Now, if we explain the query here is what we see:

EXPLAIN SELECT * FROM projects WHERE name = 'Manhattan';  
                           QUERY PLAN
----------------------------------------------------------------
 Seq Scan on projects  (cost=0.00..7246.75 rows=1242 width=106)
   Filter: (name = 'Manhattan'::text)
(2 rows)

I admit it is quite intimidating to read Postgres EXPLAIN output for the first time.. Here's a summary:

  • Type of table scan: Seq Scan
  • Estimated initial cost: 0.00
  • Estimated final cost: 7246.75
  • Estimated result rows: 1242
  • Estimated rows width: 106

Seq Scan stands for Sequential Scan: Postgres is doing a full scan on the table, which means that the database is loading the whole table in memory, and checking for each row if the name is 'Manhattan'.

Let's try adding a very simple index on name:

CREATE INDEX projects_name ON projects(name);  

Now let's run the EXPLAIN again:

EXPLAIN SELECT * FROM projects WHERE name = 'Manhattan';  
                            QUERY PLAN
------------------------------------------------------------------
 Index Scan using projects_name on projects  (cost=0.42..8.44 rows=1 width=32)
   Index Cond: (name = 'Manhattan'::text)
(2 rows)

Instead of loading the whole database into memory, postgres uses our projects_name index, searches for our project name in the index, then loads the only row we are interested in. To give a perspective on the performance improvement:

Before

Time: 56.993 ms  

After

Time: 0.403 ms  

Which is 141 times faster. Whoah!

Another Experiment

Let's try with a very similar query:

SELECT * FROM projects WHERE priority = 'low';  

EXPLAIN gives a similar result:

EXPLAIN SELECT * FROM projects WHERE priority = 'low';  
                            QUERY PLAN
------------------------------------------------------------------
 Seq Scan on projects  (cost=0.00..10621.01 rows=130620 width=32)
   Filter: ((priority)::text = 'low'::text)
(2 rows)

Seq Scan, we meet again.. But if we try adding an index on the priority field:

CREATE INDEX projects_priority ON projects(priority);  

And let's see what EXPLAIN tells us now:

EXPLAIN SELECT * FROM projects WHERE priority = 'low';  
                            QUERY PLAN
------------------------------------------------------------------
 Bitmap Heap Scan on projects  (cost=2468.09..8254.72 rows=131570 width=32)
   Recheck Cond: ((priority)::text = 'low'::text)
   ->  Bitmap Index Scan on projects_priority  (cost=0.00..2435.20 rows=131570 width=0)
         Index Cond: ((priority)::text = 'low'::text)
(4 rows)

Amazing, we have a new type of scan! A Bitmap Index Scan followed by a Bitmap Heap Scan.. The best explanation I could find comes from this SO answer.

An index scan will go through rows one by one, opening disk pages again and again, as many times as necessary.

A bitmap index scan will sequentially open a short-list of disk pages, and grab every applicable row in each one (hence the so-called recheck cond you see in query plans).

Let's take a look at the performance improvement:

Before

Time: 206.884 ms  

After

Time: 159.740 ms  

Uh, this is not nearly as impressive as our first index... It's only 1.3 times faster. But if you check the output of the command, you'll see this at the end:

(129265 rows)

Postgres had to fetch a huge amount of records, so that might explain why we haven't improved the performance of that query that much. We can verify our assumption by querying a smaller subset of the rows:

EXPLAIN SELECT * FROM projects WHERE priority = 'low' AND created_at > '2015-01-29';  
                            QUERY PLAN
------------------------------------------------------------------
 Bitmap Heap Scan on projects  (cost=2437.36..8552.91 rows=8638 width=32)
   Recheck Cond: ((priority)::text = 'low'::text)
   Filter: (created_at > '2015-01-29 00:00:00'::timestamp without time zone)
   ->  Bitmap Index Scan on projects_priority  (cost=0.00..2435.20 rows=131570 width=0)
         Index Cond: ((priority)::text = 'low'::text)

Mmm, the estimated final cost of this query is basically the same as our previous query. Let's verify that by actually running the queries:

Before

SELECT * FROM projects WHERE priority = 'low' AND created_at > '2015-01-29';  
[...]
(8712 rows)

Time: 78.440 ms  

After

SELECT * FROM projects WHERE priority = 'low' AND created_at > '2015-01-29';  
[...]
(8712 rows)

Time: 44.885 ms  

Uhh, it seems that our queries aren't getting faster anymore... Why?

The index we have added on priority is only filtering approximately 25% of the rows in the table, so it can't help us when we are filtering by the created_at field.

Composite Index to the Rescue!

We can solve our problems by creating a new composite index, which covers both fields we are interested in:

CREATE INDEX projects_priority_created_at projects(priority, created_at);  

Moar EXPLAIN pliz:

EXPLAIN SELECT * FROM projects WHERE priority = 'low' AND created_at > '2015-01-29';  
                            QUERY PLAN
------------------------------------------------------------------
 Bitmap Heap Scan on projects  (cost=224.96..4496.53 rows=8638 width=32)
   Recheck Cond: (((priority)::text = 'low'::text) AND (created_at > '2015-01-29 00:00:00'::timestamp without time zone))
   ->  Bitmap Index Scan on projects_priority_created_at  (cost=0.00..222.80 rows=8638 width=0)
         Index Cond: (((priority)::text = 'low'::text) AND (created_at > '2015-01-29 00:00:00'::timestamp without time zone))
(4 rows)

That looks much better! But does it actually work?

SELECT * FROM projects WHERE priority = 'low' AND created_at > '2015-01-29';  
[...]
(8712 rows)

Time: 14.283 ms  

Yep. 5 times faster.

The cool thing about our composite index is that postgres can use it for queries which are only filtering the priority field. To verify this:

DROP INDEX projects_priority;  
EXPLAIN SELECT * FROM projects WHERE priority = 'low';  
                            QUERY PLAN
------------------------------------------------------------------
 Bitmap Heap Scan on projects  (cost=3052.09..8838.72 rows=131570 width=32)
   Recheck Cond: ((priority)::text = 'low'::text)
   ->  Bitmap Index Scan on projects_priority_created_at  (cost=0.00..3019.20 rows=131570 width=0)
         Index Cond: ((priority)::text = 'low'::text)

You can see that postgres is using our newly created composite index to make our query faster. That feels good, right?

But what about queries which are only filtering the created_at field?

EXPLAIN SELECT * FROM projects WHERE created_at > '2015-01-29';  
                            QUERY PLAN
------------------------------------------------------------------
 Seq Scan on projects  (cost=0.00..10622.01 rows=34035 width=32)
   Filter: (created_at > '2015-01-29 00:00:00'::timestamp without time zone)
(2 rows)

Unfortunately no, that doesn't work. We can't use the composite query to filter only by created_at. The same thing would happen if we had a composite index on three fields: we would be able to use it for queries on the first field, for queries on the first two fields and for queries involving all three fields, but not on any other field combination.

Puzzle #1

What do you think would happen with this query?

EXPLAIN SELECT * FROM projects WHERE priority IN ('low', 'medium', 'high');  

Take some time to think this through, then keep on reading.

Ready?

Here's the EXPLAIN output:

                            QUERY PLAN
------------------------------------------------------------------
 Seq Scan on projects  (cost=0.00..11270.01 rows=389112 width=32)
   Filter: ((priority)::text = ANY ('{low,medium,high}'::text[]))

WAT. But what about our index?!?

The query planner has estimated that this query will return 389112 rows: therefore it wouldn't make sense to search for the entries in the index, then load up all the rows from the tables. Instead, the database will directly load up all table rows and this will be faster than using the index.

Puzzle #2

Ok, so let's restrict the result set to be more specific:

EXPLAIN SELECT * FROM projects WHERE priority IN ('low', 'medium', 'high') AND created_at BETWEEN '2015-01-29 12:00:00' AND '2015-01-29 12:00:20';  

What do you think will happen now? Keep reading on when you have an answer.

                            QUERY PLAN
------------------------------------------------------------------
 Index Scan using projects_priority_created_at on projects  (cost=0.42..25.05 rows=3 width=32)
   Index Cond: (((priority)::text = ANY ('{low,medium,high}'::text[])) AND (created_at >= '2015-01-29 12:00:00'::timestamp without time zone) AND (created_at <= '2015-01-29 12:00:20'::timestamp without time zone))

Great, so we should be using our composite index to fetch 3 rows in total. Is this accurate?

SELECT * FROM projects WHERE priority IN ('low', 'medium', 'high') AND created_at BETWEEN '2015-01-29 12:00:00' AND '2015-01-29 12:00:20';  
   id   |    name     | priority |     created_at
--------+-------------+----------+---------------------
 492482 | Tlsxtmzprc  | high     | 2015-01-29 12:00:05
 492483 | Xppnqmrjgdo | medium   | 2015-01-29 12:00:10
(2 rows)

Time: 0.597 ms  

Whoah, postgres, u smart.

To reiterate how cool indexes are, here's the performance without our composite index:

SELECT * FROM projects WHERE priority IN ('low', 'medium', 'high') AND created_at BETWEEN '2015-01-29 12:00:00' AND '2015-01-29 12:00:20';  
   id   |    name     | priority |     created_at
--------+-------------+----------+---------------------
 492482 | Tlsxtmzprc  | high     | 2015-01-29 12:00:05
 492483 | Xppnqmrjgdo | medium   | 2015-01-29 12:00:10
(2 rows)

Time: 63.507 ms  

Which is 106 times slower.

Puzzle #3

What do you think will happen with this one?

EXPLAIN SELECT * FROM projects WHERE priority IN ('low', 'medium', 'high') AND created_at BETWEEN '2015-01-29 00:00:00' AND '2015-01-29 01:00:00';  

Let's take a look at the EXPLAIN:

                            QUERY PLAN
------------------------------------------------------------------
 Bitmap Heap Scan on projects  (cost=19.96..1491.53 rows=524 width=32)
   Recheck Cond: (((priority)::text = ANY ('{low,medium,high}'::text[])) AND (created_at >= '2015-01-29 00:00:00'::timestamp without time zone) AND (created_at <= '2015-01-29 01:00:00'::timestamp without time zone))
   ->  Bitmap Index Scan on projects_priority_created_at  (cost=0.00..19.83 rows=524 width=0)
         Index Cond: (((priority)::text = ANY ('{low,medium,high}'::text[])) AND (created_at >= '2015-01-29 00:00:00'::timestamp without time zone) AND (created_at <= '2015-01-29 01:00:00'::timestamp without time zone))

What will happen when we add this composite index?

CREATE INDEX projects_created_at_priority ON projects(created_at, priority);  

Think about it, then keep on reading :)

                            QUERY PLAN
------------------------------------------------------------------
 Index Scan using projects_created_at_priority on projects  (cost=0.42..935.93 rows=524 width=32)
   Index Cond: ((created_at >= '2015-01-29 00:00:00'::timestamp without time zone) AND (created_at <= '2015-01-29 01:00:00'::timestamp without time zone) AND ((priority)::text = ANY ('{low,medium,high}'::text[])))

Wow, by inverting the order of the fields in the composite index, postgres will now use it to first restrict the result set by created_at, then filter on the more loose priority field. And does it actually work?

Before

SELECT * FROM projects WHERE priority IN ('low', 'medium', 'high') AND created_at BETWEEN '2015-01-29 00:00:00' AND '2015-01-29 01:00:00';  
[...]
(557 rows)

Time: 3.786 ms  

After

SELECT * FROM projects WHERE priority IN ('low', 'medium', 'high') AND created_at BETWEEN '2015-01-29 00:00:00' AND '2015-01-29 01:00:00';  
[...]
(557 rows)

Time: 2.245 ms  

The difference is not huge, but it's faster!

The end

Thanks for reading! If you have any questions or corrections, let me know in the comments :)

Comments

comments powered by Disqus