I also answered "about the same", and no I didn't notice the bookmark lookup (so...

jcampbell1 · on Feb 20, 2014

> It is already a highly selective query.

Is it? The first query is always O(1). The worst case for the latter query is that it must aggregate over 999,910 rows.

Consider the case where all values of 'a' are 123, and all values 'b' are 42, except 90.

pradocchia · on Feb 20, 2014

> Is it?

At least on MSSQL, I would expect a query plan like so:

  1. Index seek WHERE a = 123, yielding ~100 rows. [1]
  2. Bookmark lookup with results from (1), yielding ~100 rows.
  3. Filter (2) WHERE b = 42 and project date_column, yielding ~10 rows.
  4. Aggregate (3) by date_column, yielding ~10 rows or less.

And the optimizer will choose this over a full table scan so long as the 1+2+3 < full table scan. I don't know the threshold for that, but is is certainly more than 100 rows out 1M+, and the planner will have an estimate of selectivity that will inform plan selection.

[1] Important caveat. I interpret this line,

Current situation, selecting about hundred rows out of a million:

....to mean selection of 100 rows from the base table, rather than projection of 100 rows in the result, post-aggregation.

But if he really means a SELECT statement that returns 100 rows, then we have no idea how selective the WHERE clause is, and my answer changes to "The query will be much slower (impact >10%)".

jcampbell1 · on Feb 20, 2014

I took the latter interpretation. The "correct" solution uses the fact the first query can be solved by selecting 0 rows from the base table.

pdubs · on Feb 20, 2014

No query optimizer would look at this and say "1M rows? Let's group and aggregate before filtering." Not to mention, the question specifically states that a=? would return 100 rows and a=? and b=? would return 10.

Regarding O(1), the first query would be some form of O(n log n) or O(log n) depending on the table/index data structures.

thwarted · on Feb 21, 2014

No query optimizer would look at this and say "1M rows? Let's group and aggregate before filtering."

I hope no optimizer would say that. It is well defined that filtering, as expressed in the where clause (if it uses indexes or not) happens before group and aggregate, and that grouping happens on the result of the filtering. If the optimizer could choose one way or the other, you'd have different results. If you want to group and aggregate first, you need to explicitly express that with a subquery.

jcampbell1 · on Feb 20, 2014

I don't see that a=? rule. By selecting "100" rows out of a million, I think that means 100 distinct dates.

In this case, I take n = 1,000,000

The first query is likely O(log(x)) where x is number of distinct values of a. I approximated that to be O(1) relative to n.

I could be wrong here, or we could have seen different questions, or we are just interpreting the question differently.

nollidge · on Feb 20, 2014

We're not talking about worst case. We're talking about 100 rows and 10 rows, which is what @pradocchia means by "selective".

Indexing shouldn't follow the theoretical worst case, it should follow what's actually in your table.

ars · on Feb 21, 2014

> The worst case for the latter query is that it must aggregate over 999,910 rows.

No, it already said it only took 100 rows, it can't get worse from that.

Now if he actually meant the final result set was 100 rows (meaning after the group by) that's different. But that's not what he actually said, so the question is misleading.