This question has to come my mind many times and also I have been asked the same question many time. That what should be used GROUP BY or DISTINCT, as both perform the same thing.

So, after researching and using both statements and checking the performance using some scripts from here and there, I have come to the conclusion that GROUP BY performs much better then DISTINCT.

Now before, we move on to more discussion, lets first understand what does actually both do.

DISTINCT STATEMENT

The DISTINCT statement when you want only the distinct values from a table where there are multiple duplicate values. Thus, the DISTINCT keyword only returns the distinct values from the table.

SYNTAX

SELECT DISTINCT column_name(s)
from table_name

DISTINCT Example

The "Persons" table:

P_Id	LastName	FirstName	Address	City
1	Dave	Jay	Suraj Nagar	Mumbai
2	Doctor	Mayank	Kharghar	Mumbai
3	Alba	Jordi	Catalan Road	Barcelona

Now we want to select only the distinct values from the column named "City" from the table above.

We use the following SELECT statement:

SELECT DISTINCT City FROM Persons

The result-set will look like this:

City
Mumbai
Barcelona

GROUP BY STATEMENT

GROUP BY statement is generally used with aggregate functions to group the result_set by one or more columns.

SYNTAX

SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name

GROUP BY Example

We have the following "Orders" table:

O_Id	OrderDate	Price	Customer
1	2011/11/12	1000	X
2	2011/10/23	1600	N
3	2011/09/20	700	X
4	2011/09/21	300	X
5	2011/08/23	2000	Y
6	2011/12/05	100	N

Now we want to find the total sum (total order) of each customer.

We will have to use the GROUP BY statement to group the customers.

We use the following SQL statement:

SELECT Customer,SUM(Price) FROM Orders

GROUP BY Customer

The result-set will look like this:

Customer	SUM(Price)
X	2000
N	1700
Y	2000

Let's see what happens if we omit the GROUP BY statement:

SELECT Customer,SUM(Price) FROM Orders

The result-set will look like this:

Customer	SUM(Price)
X	5700
N	5700
X	5700
X	5700
Y	5700
N	5700

The result-set above is not what we wanted. The "SUM(Price)" returns a single value, while "Customer" returns 6 values. This will therefore not give us the correct result. However, you have seen that the GROUP BY statement solves this problem.

We can also use the GROUP BY Statement on more than one columns.

CONCLUSION

After running the above queries on a table which had >500,000 records with duplicate data and some semi-additive measures, we come to the conclusion that GROUP BY performs much better than DISTINCT, though both do the same thing but the difference lies in the way in which the execution path is for each of them.

According, to some, DISTINCT follows a path where it distinct the values at the beginning only where else the GROUP BY statement does the DISTINCT in the end, and also DISTINCT takes up less memory space for running of query, while GROUP BY is made to run the query speedily but it takes more memory.

NOTE :

NESTED GROUP BY STATEMENT, has very bad performance when compared to NESTED DISTINCT.

Now when the records are less, then we can use DISTINCT, whereas when the no of records are more then we can use GROUP BY.

Code 2 Learn

Learn, Code, Share

7/6/12

GROUP BY vs DISTINCT in SQL

DISTINCT STATEMENT

DISTINCT Example

GROUP BY STATEMENT

SYNTAX

GROUP BY Example

CONCLUSION

0 comments:

Post a Comment

Translate

Followers

Popular Posts

Tutorials

Blog Archive