如何构建事实表和多维表how to create fact and dimensions tables

2023年10月26日 · 736 字 · 2 分钟 · Database

本文转载翻译自{% referto [1], [How to create Fact and Dimension tables from denormalized raw data] %}。

如何基于非规范化的原始数据创建事实表和维度表

样本数据预览

NameGenderProfessionstateasOfDatetemperaturepulse
Rachel GreenFemaleDesignerNY2020-11-0198.460
Sheldon CooperMalePhysicistCA2020-11-0598.570
PennyFemaleWaitressCA2020-11-1599.275
George CostanzaMaleAgentNJ2020-05-0598.790
Jerry SeinfeldMaleComedianNY2020-01-0198.665

数据库、维度和表格创建

作者主要是创建一个sample数据库,和三个模式表(landing着陆?、dim维度表和fact事实表),以及与前两者相关的众多的表格。

## create database
postgres=> create database sample
postgres=> \c sample
## it will prompt you to enter Postgres login password
## create schemas
sample=> create schema landing;
sample=> create schema dim;
sample=> create schema fact;
## create tables
sample=> create table landing.rawdata (
    name varchar(100)
    ,gender varchar(20)
    ,profession varchar(50)
    ,state varchar(2)
    ,asofdate date
    ,temperature float
    ,pulse int
);
## Import CSV data into Postgres table
sample=> \COPY landing.rawdata FROM '/yourpath/sampleData.csv' DELIMITER ',' CSV HEADER;
## Add a surrogate ID column (surrogate column means, column with sequence of numbers, generally auto generated).
sample=> alter table landing.rawdata add id serial;
## create dimension tables
## Fisrt Query
create table dim.gender as 
select 
    row_number() Over(order by gender) as genderid
    ,gender 
from 
    (select distinct gender from landing.rawdata) t;
## Second Query
create table dim.person as
select 
    row_number() Over(order by name) as personid
   ,name 
from 
    (select distinct name from landing.rawdata) t;
## Third Query
create table dim.profession as
select 
    row_number() Over(order by profession) as professionid
    ,profession 
from 
    (select distinct profession from landing.rawdata) t;
## Fourth Query
create table dim.state as 
select 
    row_number() Over(order by state) as stateid
    ,state 
from 
    (select distinct state from landing.rawdata) t;   
## Build Fact table based on IDs from Dimension Table.
create table fact.user
as
select
    r.id
    ,p.personid
    ,g.genderid
    ,pr.professionID
    ,s.stateID
    ,r.asofdate
    ,r.temperature
    ,r.pulse
from
    landing.rawdata r
    JOIN dim.person as p on r.name = p.name
    JOIN dim.gender as g on r.gender = g.gender
    JOIN dim.profession as pr on r.profession = pr.profession
    JOIN dim.state as s on r.state = s.state;

个人理解

这里基于我贫弱的数据库理解去试图解释一下为什么要这么做,它的目的是什么。

做了什么(星座模型)?

sample databasecontent
landing创建一个landing.rawdata储存全部的数据
dim创建各个维度表:- dim.person- dim.gender- dim.profession- dim.state
fact创建事实表来分别关联到各个维度表,例如创建一个user事实表可以这样:- fact.user存储- - landing.rawdata的id- - dim.gender的gender代码- - dim.profession的profession代码- - dim.state的state代码

为什么这么做?

优势

  • 查询效率高
  • 维度多,数据解耦
  • 关系明朗
  • 用户友好

劣势:

  • 表格繁多
  • 数据冗余多

关系可视化

mermaid-diagram-2023-02-14-154702

mermaid图地址

参考资料

{% referfrom [1], [How to create Fact and Dimension tables from denormalized raw data], [https://medium.com/analytics-vidhya/how-to-create-fact-and-dimension-tables-from-denormalized-raw-data-e26127df2249] %}