如何构建事实表和多维表how to create fact and dimensions tables
2023年10月26日 · 736 字 · 2 分钟 · Database
本文转载翻译自{% referto [1], [How to create Fact and Dimension tables from denormalized raw data] %}。
如何基于非规范化的原始数据创建事实表和维度表
样本数据预览
Name | Gender | Profession | state | asOfDate | temperature | pulse |
---|---|---|---|---|---|---|
Rachel Green | Female | Designer | NY | 2020-11-01 | 98.4 | 60 |
Sheldon Cooper | Male | Physicist | CA | 2020-11-05 | 98.5 | 70 |
Penny | Female | Waitress | CA | 2020-11-15 | 99.2 | 75 |
George Costanza | Male | Agent | NJ | 2020-05-05 | 98.7 | 90 |
Jerry Seinfeld | Male | Comedian | NY | 2020-01-01 | 98.6 | 65 |
数据库、维度和表格创建
作者主要是创建一个sample数据库,和三个模式表(landing着陆?、dim维度表和fact事实表),以及与前两者相关的众多的表格。
## create database
postgres=> create database sample
postgres=> \c sample
## it will prompt you to enter Postgres login password
## create schemas
sample=> create schema landing;
sample=> create schema dim;
sample=> create schema fact;
## create tables
sample=> create table landing.rawdata (
name varchar(100)
,gender varchar(20)
,profession varchar(50)
,state varchar(2)
,asofdate date
,temperature float
,pulse int
);
## Import CSV data into Postgres table
sample=> \COPY landing.rawdata FROM '/yourpath/sampleData.csv' DELIMITER ',' CSV HEADER;
## Add a surrogate ID column (surrogate column means, column with sequence of numbers, generally auto generated).
sample=> alter table landing.rawdata add id serial;
## create dimension tables
## Fisrt Query
create table dim.gender as
select
row_number() Over(order by gender) as genderid
,gender
from
(select distinct gender from landing.rawdata) t;
## Second Query
create table dim.person as
select
row_number() Over(order by name) as personid
,name
from
(select distinct name from landing.rawdata) t;
## Third Query
create table dim.profession as
select
row_number() Over(order by profession) as professionid
,profession
from
(select distinct profession from landing.rawdata) t;
## Fourth Query
create table dim.state as
select
row_number() Over(order by state) as stateid
,state
from
(select distinct state from landing.rawdata) t;
## Build Fact table based on IDs from Dimension Table.
create table fact.user
as
select
r.id
,p.personid
,g.genderid
,pr.professionID
,s.stateID
,r.asofdate
,r.temperature
,r.pulse
from
landing.rawdata r
JOIN dim.person as p on r.name = p.name
JOIN dim.gender as g on r.gender = g.gender
JOIN dim.profession as pr on r.profession = pr.profession
JOIN dim.state as s on r.state = s.state;
个人理解
这里基于我贫弱的数据库理解去试图解释一下为什么要这么做,它的目的是什么。
做了什么(星座模型)?
sample database | content |
---|---|
landing | 创建一个landing.rawdata储存全部的数据 |
dim | 创建各个维度表:- dim.person- dim.gender- dim.profession- dim.state |
fact | 创建事实表来分别关联到各个维度表,例如创建一个user事实表可以这样:- fact.user存储- - landing.rawdata的id- - dim.gender的gender代码- - dim.profession的profession代码- - dim.state的state代码 |
为什么这么做?
优势:
- 查询效率高
- 维度多,数据解耦
- 关系明朗
- 用户友好
劣势:
- 表格繁多
- 数据冗余多
关系可视化
参考资料
{% referfrom [1], [How to create Fact and Dimension tables from denormalized raw data], [https://medium.com/analytics-vidhya/how-to-create-fact-and-dimension-tables-from-denormalized-raw-data-e26127df2249] %}