DMW Knowledge Base
What is a data warehouse?
A data warehouse is a subject oriented, integrated, time variant, non volatile collection of data in support of management's decision making process.
What is data mining?
Explore predictive hidden patterns from a large collection of data.
What is a surrogate key?
In order to isolate the data warehouse from source systems, we will introduce a technical surrogate key instead of re-using the source system's natural (business) key. If a surrogate key generation process is implemented correctly, adding a new source system to the data warehouse processing will not require major efforts.
What are Slowly Changing Dimensions?
Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.
Type 1 : Overwriting the old values.
Type 2 : Creating an another additional record.
Type 3 : Creating new fields.
More>>
What is Star Schema?
More>>
What are Snowflake schemas?
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a location dimension table in a star schema might be normalized into a location table and city table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance.
More>>
Supervised vs. Unsupervised learning
In supervised learning, the classes are predetermined. The data presented to a machine learning algorithm is fully labelled. eg. neural networks
In unsupervised learning, algorithm decides how to group samples into classes that share common properties. eg. Clustering
What is KPI?
KPI stands for Key Performance Indicator.
In Analysis Services, KPI is a collection of calculations that are associated with a measure group in a cube that are used to evaluate business success.
KPIs are created from existing measures. However, not all measures are good candidates for KPI utilization.
What is pivot table?
A pivot table makes it easy to extract information from a large table of data without use of formulas or functions. Extracted data can be sorted, filtered. Viewing data as dimensions are possible.
What is ETL?
In data Warehouse, data from different source systems are integrated, processed and stored. The ETL (Extraction, Transformation and Loading) process handles these data and transforms it into a more consistent, standard formatted data. This ETL process is done with the help of an ETL tool and most widely used tools are Informatica, Data Stage, Ab Inito, Oracle Warehouse Builder etc.
The first step in ETL process is mapping the data between source systems and target database(data warehouse or data mart). The second step is cleansing of source data in staging area. The third step is transforming cleansed source data and then loading into the target system.
ETL Tasks/ Data Transformations
Filtering - The data Filter Operation takes all the row values from the input dataset, tests the value of a field in the row for one a condition, if the condition is false for the attribute value of the row, then we ignore the row else we pass the row into to the result dataset.
Data lookup - Performs lookups by joining data in input columns with columns in a reference dataset.
Derive columns - Values of a column retrieved using other columns
Data cleansing - The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process.
Data validation - Data types, formats(email, url, zip code)
What is Meta Data?
Data that describes data and other structures, such as objects, business rules, and processes.
Eg. the schema design of a data warehouse .
A repository contains metadata.
Difference between Relational modeling and Multidimensional modeling
1. Relational models are complex due to joins with other tables. Multidimensional modeling are very simple. Each of the dimension table has a direct relationship with the fact table.
2. Relational models are mostly normalized. Multidimensional modeling are radically denormalized.
3. Standard relational models are optimized for On Line Transaction Processing. OLTP needs the ability to efficiently update data. This is provided in a normalized database that has each value stored only once.
Multidimensional modeling are optimized for On Line Analytical Processing. OLAP needs the ability to retrieve data efficiently. Efficient data retrieval requires a minimum number of joins. This is provided with the simple structure of relationship in a Multidimensional modeling, where each dimension table is only a single join away from the fact table.
4. SQL queries required to manipulate relational data. MDX queries required to manipulate multi-dimensional data
5. Tables are units of relational data storage. Cubes are units of multi-dimensional data storage.
What is a Gini index?
How impure a node is after splitting has occurred.