This question has to come my mind many times and also I have been asked the same question many time. That what should be used GROUP BY or DISTINCT, as both perform the same thing.
So, after researching and using both statements and checking the performance using some scripts from here and there, I have come to the conclusion that GROUP BY performs much better then DISTINCT.
Now before, we move on to more discussion, lets first understand what does actually both do.
DISTINCT STATEMENT
The DISTINCT statement when you want only the distinct values from a table where there are multiple duplicate values. Thus, the DISTINCT keyword only returns the distinct values from the table.
SYNTAX
SELECT DISTINCT column_name(s)
from table_name
DISTINCT Example
The "Persons" table:
P_Id | LastName | FirstName | Address | City |
---|---|---|---|---|
1 | Dave | Jay | Suraj Nagar | Mumbai |
2 | Doctor | Mayank | Kharghar | Mumbai |
3 | Alba | Jordi | Catalan Road | Barcelona |
Now we want to select only the distinct values from the column named "City" from the table above.
We use the following SELECT statement:
SELECT DISTINCT City FROM Persons
The result-set will look like this:
City |
---|
Mumbai |
Barcelona |
GROUP BY STATEMENT
GROUP BY statement is generally used with aggregate functions to group the result_set by one or more columns.
SYNTAX
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name
GROUP BY Example
We have the following "Orders" table:
O_Id | OrderDate | Price | Customer |
---|---|---|---|
1 | 2011/11/12 | 1000 | X |
2 | 2011/10/23 | 1600 | N |
3 | 2011/09/20 | 700 | X |
4 | 2011/09/21 | 300 | X |
5 | 2011/08/23 | 2000 | Y |
6 | 2011/12/05 | 100 | N |
Now we want to find the total sum (total order) of each customer.
We will have to use the GROUP BY statement to group the customers.
We use the following SQL statement:
SELECT Customer,SUM(Price) FROM Orders
GROUP BY Customer
The result-set will look like this:
Customer | SUM(Price) |
---|---|
X | 2000 |
N | 1700 |
Y | 2000 |
Let's see what happens if we omit the GROUP BY statement:
SELECT Customer,SUM(Price) FROM Orders
The result-set will look like this:
Customer | SUM(Price) |
---|---|
X | 5700 |
N | 5700 |
X | 5700 |
X | 5700 |
Y | 5700 |
N | 5700 |
The result-set above is not what we wanted. The "SUM(Price)" returns a single value, while "Customer" returns 6 values. This will therefore not give us the correct result. However, you have seen that the GROUP BY statement solves this problem.
We can also use the GROUP BY Statement on more than one columns.
CONCLUSION
After running the above queries on a table which had >500,000 records with duplicate data and some semi-additive measures, we come to the conclusion that GROUP BY performs much better than DISTINCT, though both do the same thing but the difference lies in the way in which the execution path is for each of them.
According, to some, DISTINCT follows a path where it distinct the values at the beginning only where else the GROUP BY statement does the DISTINCT in the end, and also DISTINCT takes up less memory space for running of query, while GROUP BY is made to run the query speedily but it takes more memory.
NOTE :
NESTED GROUP BY STATEMENT, has very bad performance when compared to NESTED DISTINCT.
Now when the records are less, then we can use DISTINCT, whereas when the no of records are more then we can use GROUP BY.
0 comments:
Post a Comment