10/19/13

Primary Index Choice Criteria

We have already seen Tutorial Primary Index and we have understood how Primary Index work and how they us in maximizing performance.

Today we will be learning on how to choose Primary Index/Indexes in a given table. But before we move ahead in defining the criteria for choice of Primary Index, here is a Tip:

TIP

If you don't define any Index on a Table, then Teradata decides on its own. Its makes decision on the below points:

1. If you have any column with Primary Key Constraint in the table definition, Teradata will make column as Unique Primary Index.

2. If you have a/many columns with Unique Constraint, then Teradata will choose only the 1st column in the table definition as Unique Primary Index & others as Unique Secondary Index.

3. If you don't have column with defined as either Primary key or with Unique Constraint, then Teradata makes the 1st column in the table definition as Non Unique Primary Index.

Primary Index Choice Criteria

Now lets start with the Criteria's of Choosing a good Primary Index. Basically, there are three Primary Index Choice Criteria: Access Demographics, Distribution Demographics, and Volatility.

Access Demographics

By Access Demographics we mean, those columns which were used by the user to access the table, i.e. columns used in the WHERE clause of the SQL Statement. So choose the column(s) which were most frequently used for access to maximize the number of one-AMP operation. We need to consider both value as well as join access

Distribution Demographics

So as we know from our previous post on Primary Indexes, More unique the index, the better the distribution. Optimizing distribution optimizes parallel-processing.

Volatilty

You must understood if you know the meaning of the word Volatile. This point means that we have to choose a column which will have a low change rate. The Primary Index should not at all be volatile because any changes in the PI will result in heavy I/O overhead i.e. as a result in change the PI has to be moved from one AMP to another. Therefore, we have to choose a column which will have stable values.

**Note**

There is a trade-off between the access & distribution demographics. The most desirable situation is to find a column which has a good access and good distribution demographics.

Tutorial on Primary Index in Teradata

Last tutorial was on Indexes in Teradata in which we covered Primary Index, Secondary Index and Join Index. What we saw was just a summary of all the Indexes.

This tutorial will help you in understanding Primary Index in a much clearer way. As we saw last time a Primary Index determines the distribution of the rows on the disks controlled AMPs. A primary index in Teradata in required for row distribution and storage.

Below image shows how the distribution takes place using the Primary Indexes:

The PI for a table should represent the data values used by SQL to access the data for the table. Therefore, a PI for a table should be selected very carefully as it can improve or degrade the performance of SQL used for accessing that table.

Rules for Defining a PRIMARY INDEX

Below are some rules that I have come across while identifying on which column to be defined as PI:

* The PI selected should be such that it has unique index as more the unique index more evenly the rows will be distributed by the AMPs and better will be the space utilization.

* Define the index on as few columns as you can.

* The PI defined can have unique or non-unique values. If defined as unique then you will have to make sure that the each value you pass into the column is unique and never repeated whereas if defined non-unique you can pass duplicate values into that column.

* Flags and Junk Dimensions are never to be defined as PI as it leads to more skewness. Therefore that data in skewed and when you perform a join it can hamper the performance and lead to spool space error.

NOTE:

If you forget to define PRIMARY INDEX while doing a CREATE TABLE, the default will be use i.e. on the following:

* PRIMARY key

* First UNIQUE constraint

* First column

Creating PRIMARY INDEX

Unique primary index in created using the (UNIQUE) PRIMARY INDEX clause in the CREATE TABLE statement. Non-unique primary index are also created in the same way just removing the (UNIQUE) clause.

NOTE : Once a PRIMARY INDEX in created on TABLE it cannot be dropped or modified, the index must be changed by recreating the TABLE.

Example: Creating a UNIQUE PRIMARY INDEX

CREATE TABLE EMPLOYEE
(EMP_ID INTEGER
, EMP_NAME CHAR(15)
, EMP_GRADE CHAR(1))
 UNIQUE PRIMARY INDEX (EMP_ID);

Example: Creating a NON-UNIQUE PRIMARY INDEX

CREATE TABLE STUDENT
(STUDENT_ID CHAR(15)
, STUDENT_NAME CHAR(20)
, DEPARTMENT CHAR(10)) 
PRIMARY INDEX (STUDENT_ID);

Data Access using PI

When a query contains a WHERE clause, and it has primary index value(s), the request in then processed by hashing the values to find the AMP where the row is stored and then the row which contains the same hash value in the RowID part of it is returned.

Primary Key vs Primary Index

I have seen many people who are working on Teradata still confuse between Primary Key and Primary Index. So here I have done a small comparison between both as both differ conceptually:

Term	Primary Key	Primary Index
Requirement	Not required, unless referential integrity checks are to be performed	Required
Defining	Define by CREATE TABLE statement	Defined by CREATE TABLE statement
Uniqueness	Unique	Unique or non-unique
Function	Identifies a row uniquely	Distributes rows
Values can be changed?	No	Yes
Can be null?	No	Yes
Related to access path?	No	Yes

Hope this tutorial and the above short comparison helps everyone to understand what exactly PRIMARY INDEX is and how is should be chosen and defined. Do let me know your views about the post and help me in improving it.

Indexes in Teradata

Teradata as we all know is a Relational Database Management System (RDBMS) for the world's largest commercial databases. It is the market leader in Data Warehousing. Its architecture is so designed that it takes advantage of the concept of parallelism.

Almost every Database has what we know as KEY i.e. Primary Key, Foreign Key etc defined while implementing the DDL of a table. But in Teradata we don't have any concept of Keys instead what we have is INDEX. Indexes are conceptually different from keys. PRIMARY KEY is relational modelling term that defines, in the logical model whereas the PRIMARY INDEX is a physical database implementation term that defines the actual columns used to distribute and access rows in a table.

What is an Index?

In the Teradata RDBMS, an index is used to define row uniqueness and retrieve data rows, it also can be used to enforce the primary key and unique constraint for a table. The Teradata RDBMS support five types of indexes:

* Unique Primary Index (UPI)
* Unique Secondary Index (USI)
* Non-Unique Primary Index (NUPI)
* Non-Unique Secondary Index (NUSI)
* Join Index

A typical index contains two fields:

* A Value
* A Pointer to instances of the value in a data table

Teradata uses the index and hashing to distribute rows across the AMPs, the value is condensed into an entity called a row hash.

The following rules apply to the indexes:

* As mentioned above index help in distribution and retrieval of rows of a table. It can be made up of one or more than one column.
* A table can have a number of indexes, including 1 Primary Index and more than 32 Secondary Index.
* An index defined for a table can be primary or secondary, and can be unique or non-unique. Each type of index affects the performance of the system and also the data integrity.
* An index is generally defined on columns of a table which are generally used in WHERE or join condition.
* Index helps in enforcing PRIMARY KEY and UNIQUE constraint.

Primary Index

Primary Index determines the distribution of the rows on the disks controlled AMPs. A primary index in Teradata RDBMS is required for row distribution and storage. When a row is inserted, its hash code is calculated using a hashing algorithm and depending on the whether the index is UNIQUE or Non-UNIQUE it is checked for duplication of those index. Rows having the same hash value are stored on the AMP.

Secondary Index

In addition to a primary index, up to 32 unique and non-unique secondary indexes can be defined to a table. Comparing to primary indexes, Secondary indexes allow an alternate path to access the rows in the table which is less frequently used. Basically a Secondary Index is a subtable that is stored in all AMPs separately from the primary table. The subtable made and maintained by the system contains the following information:

* RowIDs of the subtable rows
* Base table index column values
* RowIDs of the base table rows

Below diagram shows how the secondary index subtable on each AMP is associated with the base table by the RowIDs:

Join Index

A join index is nothing but an indexing structure which contains columns from multiple tables. Rather than having to join individual tables each time the join operation is needed, the query can be resolved via a join index and in most cases dramatically improves performance.

SYNTAX:

CREATE JOIN INDEX [<database-name>.]<index_name>
[[NO] FALLBACK]
AS SELECT
[<columns>]
[SUM numeric-expression]
[COUNT coulmn-expression]
[EXTRACT year|month from date-expressino]
FROM [<database-name(s).]<table-names>]
[WHERE <search-condition>]
[GROUP BY <coulmn-name>]
[ORDER BY <coulmn-name>]
PRIMARY INDEX(<column-name>)
[index <column-name> ORDER by HASH|VALUES]
;

QuickSort Algorithm Tutorial

We have already done tutorial on Merge Sort and a tutorial on Heap Sort (Array Based) with both having a time complexity of O(n*log n). Here is another algorithm which has a time complexity of O(n*log n) and it's called QuickSort.

QuickSort as we all know has a similar approach to Merge Sort i.e. it uses Divide-and-Conquer recursive algorithm to sort the values. The difference being is it's an in-place sorting algorithm.

Basically an in-place algorithm is one which transforms the input using a data structure with a small, constant amount of extra storage space.

Binary Search Tree (BST) Algorithm Tutorial

Earlier we had a tutorial on Binary Seach Tree Basics, which you can check for refreshing the knowledge about it. Today we will be taking a look on BST algorithm and implementing it using Java.

Binary Search Tree is node based binary tree data structure with the following properties:

* The Left subtree contains the nodes with keys less than the node's key.
* The Right subtree contains the nodes with keys greater than the node's key.
* Both the right and left subtree should also be binary search tree.
* There should not be any duplicate nodes.

We have implemented below operations of Binary Search Tree:

* Searching
* Insert Node
* MinValue
* MaxValue

We will be seeing each of the operation and the corresponding Java code.

Data Warehouse Design Approaches

In our previous posts we have got to learn about Data Warehousing Objects, different kinds of Data Warehouse schemas and Data Warehouse Basics. Now it time we learn about how to build or design a Data Warehouse.

Designing or Building of a Data Warehouse can be done following either one of the approaches. These approaches are notably known as:

* The Top-Down Approach

* The Bottom-Up Approach

These approaches are defined by the two of the bearers of Data Warehousing namely Ralph Kimball and Bill Inmon.

Code 2 Learn

Learn, Code, Share

10/19/13