7/6/12

GROUP BY vs DISTINCT in SQL

This question has to come my mind many times and also I have been asked the same question many time. That what should be used GROUP BY or DISTINCT, as both perform the same thing.


So, after researching and using both statements and checking the performance using some scripts from here and there, I have come to the conclusion that GROUP BY performs much better then DISTINCT.

Now before, we move on to more discussion, lets first understand what does actually both do.


DISTINCT STATEMENT



The DISTINCT statement when you want only the distinct values from a table where there are multiple duplicate values. Thus, the DISTINCT keyword only returns the distinct values from the table.

SYNTAX

SELECT DISTINCT column_name(s)
from table_name


DISTINCT Example



The "Persons" table:


P_IdLastNameFirstNameAddressCity
1DaveJaySuraj NagarMumbai
2DoctorMayankKhargharMumbai
3AlbaJordiCatalan RoadBarcelona


Now we want to select only the distinct values from the column named "City" from the table above.

We use the following SELECT statement:


SELECT DISTINCT City FROM Persons


The result-set will look like this:


City
Mumbai
Barcelona


GROUP BY STATEMENT


GROUP BY statement is generally used with aggregate functions to group the result_set by one or more columns.


SYNTAX



SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name


GROUP BY Example


We have the following "Orders" table:


O_IdOrderDatePriceCustomer
12011/11/121000X
22011/10/231600N
32011/09/20700X
42011/09/21300X
52011/08/232000Y
62011/12/05100N


Now we want to find the total sum (total order) of each customer.

We will have to use the GROUP BY statement to group the customers.

We use the following SQL statement:


SELECT Customer,SUM(Price) FROM Orders
GROUP BY Customer


The result-set will look like this:


CustomerSUM(Price)
X2000
N1700
Y2000


Let's see what happens if we omit the GROUP BY statement:


SELECT Customer,SUM(Price) FROM Orders


The result-set will look like this:


CustomerSUM(Price)
X5700
N5700
X5700
X5700
Y5700
N5700


The result-set above is not what we wanted. The "SUM(Price)" returns a single value, while "Customer" returns 6 values. This will therefore not give us the correct result. However, you have seen that the GROUP BY statement solves this problem.

We can also use the GROUP BY Statement on more than one columns.


CONCLUSION


After running the above queries on a table which had >500,000 records with duplicate data and some semi-additive measures, we come to the conclusion that GROUP BY performs much better than DISTINCT, though both do the same thing but the difference lies in the way in which the execution path is for each of them. 


According, to some, DISTINCT follows a path where it distinct the values at the beginning only where else the GROUP BY statement does the DISTINCT in the end, and also DISTINCT takes up less memory space for running of query, while GROUP BY is made to run the query speedily but it takes more memory.


NOTE :


NESTED GROUP BY STATEMENT, has very bad performance when compared to NESTED DISTINCT. 

Now when the records are less, then we can use DISTINCT, whereas when the  no of records are more then we can use GROUP BY. 

SHARE THIS POST:

0 comments:

Post a Comment