Data Warehouse and Business Intelligence Project Report Class Assessment – 2 Data Warehouse for Nasdaq Listed Company Submitted By

Data Warehouse and Business Intelligence Project Report
Class Assessment – 2

Data Warehouse for Nasdaq Listed Company

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Submitted By:
Pramit Kumar
Student Number: X01419895
Course: MSc in Data Analytic

Introduction:
Data is growing in a rapid speed and due to digitalisation and social media, the online data are becoming more important aspects of data warehouse, now within a same business subject areas unstructured data are combines with structure data and helps in making a quick and smart decisions. And with rapid growth in the business, we need a centralised place or system where heterogenous sources of data can be store and manage.
The proposed new centralised data warehouse will store the data for all worldwide Nasdaq listed company’s profile by region, their respective last sales and stocks behaviour and their review rating in terms of Work Life Balance, compensation and Benefits, Job Security/Advancement, Management and Culture in one place and helps end-users in making decision. This will be a central repository where data will have integrated from different heterogenous sources and will store current and historical data for analytic and reporting purposes. This new data warehouse will help end users to understand about the Nasdaq listed company profile, the country they belong, headcount in any specific region, current share and market values at any point of time and many more. Before doing any move or investment for any companies, any individual must have complete information of listed company at any given point of time.
Data Sources:
To build the proposed data warehouse, we need data which includes information about companies, stocks, headcount and company rating. This information can be collected from various sources like company portal, available dataset provided by websites and many more.
Data mainly has been categorized into three categories: structure, semi structure and non-structure.

Architecture:
The centralised repository is built to store the information related to listed companies under Nasdaq which includes market value, last sale, share ; stocks, headcount and relative overall worldwide rating/grades. During analysing of the source data, it is found that measure or numeric values are present in different form like additive (total market value, total headcount etc.), semi-additive (shares and stocks values) and non-additive (rates and percentage). After closely understanding the scope of source data, volume metric analysis, infrastructure and timeline, the architecture of the warehouse should be a Hybrid model having both Inman and Kimball methodologies and three-tier architecture consists of Bottom (Tier 1), Middle (Tier 2) and Top (Tier 3).

(Chauduri,1997)
Source: https://gauravag1112.wordpress.com/tag/integration-layer/

Bottom-Tier – This is central repository data warehouse in relational database system where data comes from different heterogenous sources after being extracted, cleaned, transformed and staged in-between in staging tables before finally loaded into data warehouse table. It stores the data
Middle-Tier – This part of the data warehouse is consisting of OLAP severs which built on MOLAP or ROLAP designs. This is an intermediate tier between end-user and data warehouse and often store pre-calculated measures (mostly aggregator) for relative time period so that top tier user can have faster query process.
Top-Tier: This is business intelligence layer where with the help different tools and API one connects to warehouse repository and get their required for business or decision-making reports and dashboard.
Source: https://www.guru99.com/data-warehouse-architecture.html#3

Data Warehouse Data Model:
As per the proposed architecture, it includes data model or table structure for staging layer, data warehouse layer and MOLAP or ROLAP model.
1. Staging Area: The tables which are designed in this layer are direct replica of source data columns which means attributes present in the source files should be exact match with attribute of staging tables and datatype (Varchar (256)) of staging table STG_* is same to source files (String).

2. Source File name Data type Target table name Data type SCD type
Nasdaq_Company.csv String STG_Company_Review Varchar(256) SCD 1
Company_Share.csv String STG_Company_Shares Varchar(256) SCD 1
Company_Review.csv String STG_Nasdaq_Company Varchar(256) SCD 1
Region.csv String STG_Headcount Varchar(256) SCD 1
Company_Headcount.csv String STG_Region Varchar(256) SCD 1

Note: Data Modelling tool SAP Power Designer 16.6 is used to create all data models.

2. Data Warehouse Layer:

• It is a central repository data ware house which store current and historical data over 10 years and more and follows Inmon methodology of third normal form (3NF).

• In this layer data warehouse tables are in 3NF and fulfil all Codd’s rules for relational database system. The main reason for coming is to keep all three types of measures in one place which means additive measures in DW_Nasdaq_Company table, semi-additive measure in DW_Company_Share and non-additive measures in DW_Company_Review.

• Two new columns added to DW_* tables.
o Surrogate Key column (generally auto increment)
o Timestamp column (generally system timestamp when data loaded to table)

• Since it will store all the current and historical data, in future it will be an excellent source for any downstream systems like Datamart, MOLAP or ROLAP, Statistical analysis, Data Mining etc.

Note: Data Modelling tool SAP Power Designer 16.6 is used to create all data models.

3. MOLAP or ROLAP Model: (designing a multidimensional cube)
MOLAP (multi-dimensional online analytical processing) is an excellent form of OLAP where multi-dimensional array is used for storing data and hence, quite differ compare to relation database system. Ideally, it follows Kimball methodology for storing data and categorizes tables used in mainly two forms: Dimensions and Facts, where dimension table store the characteristic of the entities and fact table generally stores the measures associated with the different entities participating in MOLAP cube construction.
Deciding factor for Dimensions and Facts.
• In the current use case, the measures stored in the data warehouse layer are additive, semi-additive and non-additive in nature, so one need to be very careful in handling the semi-additive (shares and stocks values) and non-additive (review rating and percentage) measures at MOLAP level.
• After data analysis, it was decided:
? Additive measures to store in separate fact table Fact_NasdaqComp.
? Semi-additive measures to store in separate fact Fact_Company_Share.
? Non-Additive measures will be converting to grades (A, A+, A++, Best, Good, Fair) which is non-numeric and then can be store in Behaviour Dimensions (Dim_Company_Review)
? Date and Company dimensions can act as conform dimension since it can be reused with multiple fact table e.g: Dim_Date and Dim_Company.
? Dim_Region and Dim_Headcount are regular dimension tables.

• Level of Granularity: The lowest level of single row inserted in the OLAP’s fact table.

Fact_NastaqComp ? Company/Country/Headcount/Review/Sale

Fact_Company_Share ? Date/Company/Sharesvalue/volume

• The most important feature of MOLAP tools is that it is capable of storing pre-computation values like (aggregation of Total Headcount, Total Last sale, Total Market Value) and when we run the MOLAP cube, the operation known as Cube Processing.
Source : https://en.wikipedia.org/wiki/Online_analytical_processing#Multidimensional_OLAP_.28MOLAP.29

Source: Developed MOLAP Cube
Extraction Transformation Loading (ETL) strategy:
To setup a data warehouse, ETL plays a very significant role. Before data gets into the warehouse, the scope of date nature, data model and reports should be read and understand by the architects or data modelers.
1. Method used to get data from various sources to system file location:
a. Company Information (structured data) ? Downloaded data set directly from portal as .CSV format to local device file location and loaded into staging table using SSIS.
b. Company historical stock (structured data) ? Downloaded data set directly from portal as .CSV format to local device file location and loaded into staging table using SSIS.
c. Company Ranking and Head Count (semi-structure)? Scraped information from website to .CSV to local device file location and loaded into staging table using SSIS.
d. Company Review (un-structured data)? Python scripting language has been used to get the information into .CSV format to local device file location and loaded into staging table using SSIS.

To extract the un-structured data, below python scripts has been used.
import urllib2
from bs4 import BeautifulSoup
import csv

f = open(‘companies.txt’)
lines = f.read().splitlines()
for company in lines:
with open(‘company_ratings’,’a’) as csv_file:
_page = ‘https://www.indeed.com/cmp/’+company+’?from=cmp-search-autocomplete’
page = urllib2.urlopen(_page)
soup = BeautifulSoup(page,’html.parser’)
writer = csv.writer(csv_file)
reviews = soup.find_all(‘div’, attrs={‘class’: ‘cmp-ReviewCategory-category’})
for review in reviews:
_rating = review.contents0
_category = review.contents2
writer.writerow(company, _rating.get_text(), _category.get_text())
f.close()
csv_file.close()

2. High-level diagram of source-destination flow (column mapping)

3. Pre-Requisite Tools
Pre-Requisite
Database SQL Server 17.6
Data Integration Tool SQL Server Integration Services
MOLAP server SQL Server Analysis Services

4. Develop default strategies for common activities, e.g. extraction, dimension management, etc
4.1 Staging Layer
Data extraction to Staging Layer/Landing Zone (current/daily data ? SCD 1)
It’s a work area which is transient in nature and convert all heterogenous source data becomes homogenous in form of data in the table. In the current ETL process,
1. SSIS extract the data from source file system and load into SQL server tables named as staging table.
2. It follows slowly changing dimension (SCD 1 ? Truncate and load) which mean each time before loading new data set the staging tables are truncated and then loaded with new dataset.
3. In general, source data directly dump into the staging table without much changes in data, its completely 1:1 mapping from source columns to staging table columns.
Components of the source Source type File name Data type
Comma separated file Nasdaq_Company.csv String
Comma separated file Company_Share.csv String
Comma separated file Company_Review.csv String
Comma separated file Region.csv String
Comma separated file Company_Headcount.csv String
4.
Components of the target Target table name Data type SCD type
STG_Company_Review Varchar(256) SCD 1
STG_Company_Shares Varchar(256) SCD 1
STG_Nasdaq_Company Varchar(256) SCD 1
STG_Headcount Varchar(256) SCD 1
STG_Region Varchar(256) SCD 1

5. The most important part is the conversion between Unicode and non-Unicode.

Source: Constructed SSIS Package.

6. Staging tables stores current date or daily load data only.

4.2 Data Warehouse layer
This is a centralised repository which stores current and historical data in one place and from where data can be distributed to downstream systems. In the current project, Inmon approach is used to keep the data in third normal form (3NF) to avoid data redundancy.
1. All tables (DW_*) are in third normal form and follows codd’s rule for RDBMS.
2. This layer follows slow changing dimension 2 (SCD 2) which mean one can only append data to DW_* table followed by timestamp.
3. Two new columns added to DW_* tables.
a. Surrogate Key column (generally auto increment)
b. Timestamp column (generally system timestamp when data loaded to table)

4. This layer stores current and historical data of about 10 and more than 10 years.
5. The main data transformations used:
a. Surrogate Key column for each DW_* table and should be auto incremental
b. Data Conversion of one Varachar data types to Decimal data types.
i. Insert into dbo.DW_Nasdaq_Company (Symbol, Company,LastSale, MarketCap, Country )SELECT Symbol, Company, Cast(LastSale as decimal(10,2)) as LastSale, Cast(MarketCap as decimal(25,2)) as MarketCap, Country FROM dbo.STG_Nasdaq_Company

c. Fuzzy matching ? Since the company’s name and company code are same with different abbreviation in various source data, so fuzzy look-up has been used to match the name.

d. Send mail task ? Once the data loaded to all DW_* tables, then an email intimation will be sent to support team having data loading information. (Note: currently not implemented due to SMPT issue)

e. Sequence container ? Depending on data nature and availability, a proper sequence has been introduced to loading process of DW_* table.

Components of the source Source type File name Data type
Table STG_Company_Review Varchar(256)
Table STG_Company_Shares Varchar(256)
Table STG_Nasdaq_Company Varchar(256)
Table STG_Headcount Varchar(256)
Table STG_Region Varchar(256)

Components of the target Target table name Data type SCD type
DW_Company_Review Decimal, Varchar SCD 2
DW_Company_Shares Int, Decimal, Varchar SCD 2
DW_Nasdaq_Company Int, Decimal, Varchar SCD 2
DW_Headcount Int, Varchar SCD 2
DW_Region Varchar SCD 2

4.3 Multi-Dimensional (MOLAP ? SSAS Cube)
OLAP is broader category of relational database which comprised of dimensions and measures. OLAP allow users to represent data in multidimensional direction with calculated measures. An OLAP cube comprised of three important features of analysis: consolidation (roll-up), drill down and slicing & dicing. In the current project, to build an OLAP cube, Kimball approach is used to construct a SSAS cube.
1. The cube forms STAR schema with fact at the centre and dimensions surrounding.
2. As per business requirement, the fact table stores data for 5 years.
3. Fact_Nasdaq_Comp stores additive measure like lastSale, Marketvalue & total headcount and Fact_Company_Share stores semi-additive measure like share high value, Low value, Last Closed and Volume.

a.
b.

4. Dimensions:
a. Dim_Date is one-time creation and data loaded in the time dimension creation and act as conformed dimension.

b. Dim_Company_Review dimension is “Behaviour Dimension” which handle Non-Additive measure.
Transforamtion and loading of data includes CASE statements.

c. Dim_Company dimension is slow changing dimension (Insert and Update) and act as a conformed dimension.
d. Dim_Region and Dim_HeadCount are relugar dimensions with not much changes.

5. Data Loading Mechanism using SSIS Packages:
Data loading is one of the most important and challenging task when it comes to setup a new data warehouse. While loading data to staging area or warehouse tables, there are important few aspects needs to be taken care. In the initial phase of data warehouse, we need to take care two types loading

• Historical Load
• Current Load

5.1 Historical Load.
This is one-time load to the new data warehouse and after doing the volumetric analysis of the history data, I decided to go a separate SSIS package (DWBI_Nasdaq_Historical_Load.dtsx) which will be mainly responsible for loading historical load.

Source : DWBI_Nasdaq_Historical_Load.dtsx package developed in SSIS.
5.2 Current Load.
Once the initial load or historical data load completed successfully, then delta load process (incremental) will start and can have different frequencies as per data availability. For the specific table data may be available Daily, Weekly, Monthly etc. In our case, data will gets loaded into the data warehouse daily with the help of SSIS Package (DWBI_Nasdaq_Current_Load.dtsx).

Automation Implemented:
1. Daily load:
The daily is a recursive process and everyday data from the different sources must be loaded into relate data warehouse tables. There will be certain time widow by which various sources file will arrive at designated location and we need to run the SSIS daily load package to load the data to warehouse table, to avoid this manual intervention of running the SSIS package, we can automate the process of daily load, which mean daily at given time the Daily load (DWBI_Nasdaq_Current_Load.dtsx) will run and load the data to warehouse tables. This whole automation process can be achieved by using SQL Server Agent.

For the daily load, a new Job (DWBI_Nasdaq_Comp) is created and will run daily at any mentioned time which will trigger the package (DWBI_Nasdaq_Current_Load.dtsx) and eventually load data to warehouse table.

2. Cube Processing:
Each time whenever new sets of data loaded to the data warehouse, we need to refresh the MOLAP cube which in SSAS term known as processing a cube. There must a mechanism which will automatically process the cube. There is special task under SSIS called “Analysis Services Processing task”, whenever this task trigger, the mentioned “SSAS cube” will get executed and process the developed cube intenally and can be included at the end flow of daily load (DWBI_Nasdaq_Current_Load.dtsx)

Source : DWBI_Nasdaq_Historical_Load.dtsx package developed in SSIS.

Business Intelligent Query and Report:

1. Pre-Requisite Tools
Pre-Requisite
Reporting Tool Tableau and Microsoft Excel
MOLAP server SQL Server Analysis Services

1. Region Wise Nasdaq Listed Companies Rating and Headcounts.
The below report described about the Nasdaq listed companies belongs to respective countries in European region. It also tells you the rating or grade of companies which means A+ is high company then A++ and explain the total headcount of employees in a company for European region.

2. Company Share Volume Report.
The below report describes the number of volume of share a company is having at any specific time period (date) in last 5 years and also explain, the highest share value at specific date. In past 5 years, at any specific time or date, user can go back and can see share volume for any respective listed company and understand the cyclic behaviour of shares.

3. Nasdaq Company’s Rating Regression Model
The below is about logistic regression analysis between dependent variable (Overall Rating) and independent variables (Culture ; Values, Work-Life Balance, Senior Management, Payment and Benefits). The P – values outcomes tell the significant impact of independent variables on dependent variable.