Terraform自动化创建BigQuery数据仓库

作者:MeshCloud脉时云公有云架构师 于文宝

部署和使用使用 BigQuery 创建数据仓库快速启动解决方案。本解决方案演示如何使用 BigQuery 作为数据仓库,使用 Looker Studio 作为信息中心和可视化工具在 Google Cloud 中构建数据仓库。

构建数据仓库的常见使用情形包括:

  • 汇总和创建营销分析仓库,以提高收入或其他客户指标。
  • 构建财务报告和分析。
  • 构建运营信息中心以提高公司绩效。

目标

  • 了解数据如何流入云数据仓库,以及如何使用 SQL 转换数据。
  • 根据数据构建信息中心以执行数据分析。
  • 安排 SQL 语句以定期更新数据。
  • 创建机器学习模型以预测一段时间内的数据值。

使用的产品

此解决方案使用以下 Google Cloud 产品:

  • BigQuery:伸缩能力极强的全代管式数据仓库,内置有机器学习功能。
  • Cloud Storage:一种企业级服务,为各种数据类型提供费用低廉、不受限制的对象存储。可从 Google Cloud 内部和外部访问数据,并以地理位置冗余方式复制数据。
  • Looker Studio:自助式商业智能平台,可帮助您创建和共享数据洞见。

以下 Google Cloud 产品用于暂存解决方案中首次使用的数据:

  • Cloud Functions:一种轻量级无服务器计算服务,可让您创建单一用途的独立函数来响应 Google Cloud 事件,而无需管理服务器或运行时环境。
  • Pub/Sub:一种异步且可伸缩的通讯服务,可将生成消息的服务与处理这些消息的服务分离开。
  • Eventarc:该服务可管理分离微服务之间的状态更改(事件)流,同时管理传送、安全、授权、可观测性和错误处理。
  • Workflows:一个全代管式编排平台,该平台按指定顺序以工作流的形式执行服务。Workflows 可以组合各种服务,包括 Cloud Run 或 Cloud Functions 上托管的自定义服务、Google Cloud 服务(如 BigQuery)以及任何基于 HTTP 的 API。

架构图

测试demo

此解决方案部署的示例仓库分析纽约市出租车数据,以了解出租车公司在一段时间内的业绩。下图展示了该解决方案部署的 Google Cloud 资源的架构。

Terraform自动化创建BigQuery数据仓库

解决方案流程

该架构表示用于填充和转换数据仓库数据的常见数据流动:

  1. 数据发送到 Cloud Storage 存储桶。
  2. Cloud Functions 辅助数据移动。
  3. 使用 SQL 存储过程将数据从外部表加载到 BigQuery 中。
  4. 使用 SQL 存储过程在 BigQuery 中转换数据。
  5. 根据数据创建信息中心,以使用 Looker Studio 执行更多分析。

获取所需的 IAM 权限

如需开始部署过程,您需要下表中列出的 Identity and Access Management (IAM) 权限。如果您拥有计划在其中部署解决方案的项目的 roles/owner 基本角色,则您已经拥有所有必要的权限。如果您没有 roles/owner 角色,请让管理员授予您这些权限(或包含这些权限的角色)。

所需的 IAM 权限包含所需权限的预定义角色
serviceusage.services.enableService Usage Admin
(roles/serviceusage.serviceUsageAdmin)
iam.serviceAccounts.createService Account Admin
(roles/iam.serviceAccountAdmin)
resourcemanager.projects.setIamPolicyProject IAM Admin
(roles/resourcemanager.projectIamAdmin)
config.deployments.create
config.deployments.list
Cloud Config Manager Admin
(roles/config.admin)

使用 Terraform:如果要自定义解决方案,或者要使用基础架构即代码 (IaC) 方法自动预配和管理资源,请使用此方法。从 GitHub 下载 Terraform 配置,根据需要可以选择自定义代码,然后使用 Terraform 部署解决方案。部署解决方案后,您可以继续使用 Terraform 来管理该解决方案。

测试代码demo

https://github.com/terraform-google-modules/terraform-google-bigquery.git

部署步骤;

设置 Terraform 客户端 (参考官网)

 查看客户端版本
 terraform -v
 Terraform v1.3.6
on linux_amd64

克隆代码

git clone https://github.com/terraform-google-modules/terraform-google-bigquery.git
ls terraform-google-bigquery 

查看代码配置说明

cd /tmp/terraform-google-bigquery/modules/data_warehouse
README.md  assets  main.tf  outputs.tf  variables.tf  versions.tf
  1. 在同一目录中,创建一个名为 terraform.tfvars 的文本文件。
  2. 在 terraform.tfvars 文件中,复制以下代码段,并为所需的变量设置值。
  • 按照代码段中作为注释提供的说明进行操作。
  • 此代码段仅包含您必须为其设置值的变量。Terraform 配置包括具有默认值的其他变量。如需查看所有变量和默认值,请参阅 $HOME/tmp/terraform-google-bigquery/modules/data_warehouse 目录中提供的 variables.tf 文件。
  • 确保您在 terraform.tfvars 文件中设置的每个值都与 variables.tf 文件中声明的变量类型匹配。例如,如果为 variables.tf 文件中的变量定义的类型为 bool,则必须将 true 或 false 指定为该变量在 terraform.tfvars 文件中的值。
# This is an example of the terraform.tfvars file.
# The values in this file must match the variable types declared in variables.tf.
# The values in this file override any defaults in variables.tf.

# ID of the project in which you want to deploy the solution
project_id = "PROJECT_ID"

# Google Cloud region where you want to deploy the solution
# Example: us-central1
region = "REGION"

# Whether or not to enable underlying apis in this solution.
# Example: true
enable_apis = true

# Whether or not to protect BigQuery resources from deletion when solution is modified or changed.
# Example: false
force_destroy = false

# Whether or not to protect Cloud Storage resources from deletion when solution is modified or changed.
# Example: true
deletion_protection = true

运行以下命令初始化 Terraform:

terraform init

等待系统显示以下消息:

Terraform has been successfully initialized!
Terraform自动化创建BigQuery数据仓库

验证并查看 Terraform 配置

确保当前工作目录为 $
HOME/tmp/terraform-google-bigquery/modules/data_warehouse。 如果不是,请转到该目录

terraform validate

如果该命令返回任何错误,请在配置中进行所需的更正,然后再次运行 terraform validate 命令。重复此步骤,直到命令返回以下消息:

Success! The configuration is valid.
Terraform自动化创建BigQuery数据仓库

查看配置中定义的资源:

terraform plan
var.project_id
  Google Cloud Project ID

  Enter a value: mec-test-344202

var.region
  Google Cloud Region

  Enter a value: us-central1
Terraform自动化创建BigQuery数据仓库
Terraform自动化创建BigQuery数据仓库

预配资源

如果 Terraform 配置中不需要进一步更改,请部署资源。

  1. 确保当前工作目录为 $HOME/tmp/terraform-google-bigquery/modules/data_warehouse。 如果不是,请转到该目录。
  2. 应用 Terraform 配置:
terraform apply
  1. 如果您未按照上文所述创建 terraform.tfvars 文件,则 Terraform 会提示您为没有默认值的变量输入值。输入必需的值。
  2. Terraform 会显示将创建的资源列表。
  3. 当系统提示您执行操作时,请输入 yes。
  4. Terraform 展示一些显示部署进度的消息。
Terraform自动化创建BigQuery数据仓库

Terraform 输出还列出了您需要的以下其他信息:

  • 已部署的信息中心的 Looker Studio 网址。
  • 用于打开 BigQuery 编辑器以进行部分示例查询的链接。

以下示例展示了输出内容

bigquery_editor_url = "https://console.cloud.google.com/bigquery?project=mec-test-344202&ws=!1m5!1m4!6m3!1smec-test-344202!2sds_edw!3ssp_sample_queries"
ds_friendly_name = "My EDW Dataset"
function_uri = "https://workflow-initial-23a4d4e7-klfx7irarq-uc.a.run.app"
lookerstudio_report_url = "https://lookerstudio.google.com/reporting/create?c.reportId=402d64d6-2a14-45a1-b159-0dcc88c62cd5&ds.ds0.datasourceName=vw_taxi&ds.ds0.projectId=mec-test-344202&ds.ds0.type=TABLE&ds.ds0.datasetId=ds_edw&ds.ds0.tableId=vw_lookerstudio_report"
neos_tutorial_url = "https://console.cloud.google.com/products/solutions/deployments?walkthrough_id=data-warehouse_tour&project=mec-test-344202"
raw_bucket = "ds-edw-raw-23a4d4e7"

验证创建的服务

  1. 数据发送到 Cloud Storage 存储桶。
Terraform自动化创建BigQuery数据仓库
  1. Cloud Functions 辅助数据移动。
Terraform自动化创建BigQuery数据仓库

3.查看lookerstudio可视化

Terraform自动化创建BigQuery数据仓库
Terraform自动化创建BigQuery数据仓库

Bigquery 数据分析仓库

Terraform自动化创建BigQuery数据仓库
-- Copyright 2023 Google LLC
--
-- Licensed under the Apache License, Version 2.0 (the "License");
-- you may not use this file except in compliance with the License.
-- You may obtain a copy of the License at
--
--      http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.

/*
Use Cases:
    - BigQuery supports full SQL syntax and many analytic functions that make complex queries of lots of data easy

Description:
    - Show joins, date functions, rank, partition, pivot

Reference:
    - Rank/Partition: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
    - Pivot: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#pivot_operator

Clean up / Reset script:
    n/a
*/

--Rank, Pivot, Json

-- Query: Get trips over $50 for each day of the week.
-- Shows: Date Functions, Joins, Group By, Having, Ordinal Group/Having
SELECT FORMAT_DATE("%w", Pickup_DateTime) AS WeekdayNumber,
       FORMAT_DATE("%A", Pickup_DateTime) AS WeekdayName,
       vendor.Vendor_Description,
       payment_type.Payment_Type_Description,
       SUM(taxi_trips.Total_Amount) AS high_value_trips
  FROM `mec-test-344202.ds_edw.taxi_trips` AS taxi_trips
       INNER JOIN `mec-test-344202.ds_edw.vendor` AS vendor
               ON cast(taxi_trips.Vendor_Id as INT64) = vendor.Vendor_Id
              AND taxi_trips.Pickup_DateTime BETWEEN '2022-01-01' AND '2022-02-01'
        LEFT JOIN `mec-test-344202.ds_edw.payment_type` AS payment_type
               ON cast(taxi_trips.payment_type as INT64) = payment_type.Payment_Type_Id
GROUP BY 1, 2, 3, 4
HAVING SUM(taxi_trips.Total_Amount) > 50
ORDER BY WeekdayNumber, 3, 4;

-- Query: amounts (Cash/Credit) by passenger type
WITH TaxiDataRanking AS
(
SELECT CAST(Pickup_DateTime AS DATE) AS Pickup_Date,
       cast(taxi_trips.payment_type as INT64) as Payment_Type_Id,
       taxi_trips.Passenger_Count,
       taxi_trips.Total_Amount,
       RANK() OVER (PARTITION BY CAST(Pickup_DateTime AS DATE),
                                 taxi_trips.payment_type
                        ORDER BY taxi_trips.Passenger_Count DESC,
                                 taxi_trips.Total_Amount DESC) AS Ranking
  FROM `mec-test-344202.ds_edw.taxi_trips` AS taxi_trips
 WHERE taxi_trips.Pickup_DateTime BETWEEN '2022-01-01' AND '2022-02-01'
   AND cast(taxi_trips.payment_type as INT64) IN (1,2)
)
SELECT Pickup_Date,
       Payment_Type_Description,
       Passenger_Count,
       Total_Amount
  FROM TaxiDataRanking
       INNER JOIN `mec-test-344202.ds_edw.payment_type` AS payment_type
               ON TaxiDataRanking.Payment_Type_Id = payment_type.Payment_Type_Id
WHERE Ranking = 1
ORDER BY Pickup_Date, Payment_Type_Description;

-- Query: data summed by payment type and passenger count, then pivoted based upon payment type
WITH MonthlyData AS
(
SELECT FORMAT_DATE("%B", taxi_trips.Pickup_DateTime) AS MonthName,
       FORMAT_DATE("%m", taxi_trips.Pickup_DateTime) AS MonthNumber,
       CASE WHEN cast(taxi_trips.payment_type as INT64) = 1 THEN 'Credit'
            WHEN cast(taxi_trips.payment_type as INT64) = 2 THEN 'Cash'
            WHEN cast(taxi_trips.payment_type as INT64) = 3 THEN 'NoCharge'
            WHEN cast(taxi_trips.payment_type as INT64) = 4 THEN 'Dispute'
         END AS PaymentDescription,
       taxi_trips.Passenger_Count,
       taxi_trips.Total_Amount
  FROM `mec-test-344202.ds_edw.taxi_trips` AS taxi_trips
 WHERE taxi_trips.Pickup_DateTime BETWEEN '2022-01-01' AND '2022-02-01'
   AND Passenger_Count IS NOT NULL
   AND cast(payment_type as INT64) IN (1,2,3,4)
)
SELECT MonthName,
       Passenger_Count,
       FORMAT("%'d", CAST(Credit   AS INTEGER)) AS Credit,
       FORMAT("%'d", CAST(Cash     AS INTEGER)) AS Cash,
       FORMAT("%'d", CAST(NoCharge AS INTEGER)) AS NoCharge,
       FORMAT("%'d", CAST(Dispute  AS INTEGER)) AS Dispute
  FROM MonthlyData
 PIVOT(SUM(Total_Amount) FOR PaymentDescription IN ('Credit', 'Cash', 'NoCharge', 'Dispute'))
ORDER BY MonthNumber, Passenger_Count;

-- Query: data pivoted by payment type
WITH MonthlyData AS
(
SELECT FORMAT_DATE("%B", taxi_trips.Pickup_DateTime) AS MonthName,
       FORMAT_DATE("%m", taxi_trips.Pickup_DateTime) AS MonthNumber,
       CASE WHEN cast(taxi_trips.payment_type as INT64) = 1 THEN 'Credit'
            WHEN cast(taxi_trips.payment_type as INT64) = 2 THEN 'Cash'
            WHEN cast(taxi_trips.payment_type as INT64) = 3 THEN 'NoCharge'
            WHEN cast(taxi_trips.payment_type as INT64) = 4 THEN 'Dispute'
         END AS PaymentDescription,
       SUM(taxi_trips.Total_Amount) AS Total_Amount
  FROM `mec-test-344202.ds_edw.taxi_trips` AS taxi_trips
 WHERE taxi_trips.Pickup_DateTime BETWEEN '2022-01-01' AND '2022-02-01'
   AND Passenger_Count IS NOT NULL
   AND cast(taxi_trips.payment_type as INT64) IN (1,2,3,4)
 GROUP BY 1, 2, 3
)
SELECT MonthName,
       FORMAT("%'d", CAST(Credit   AS INTEGER)) AS Credit,
       FORMAT("%'d", CAST(Cash     AS INTEGER)) AS Cash,
       FORMAT("%'d", CAST(NoCharge AS INTEGER)) AS NoCharge,
       FORMAT("%'d", CAST(Dispute  AS INTEGER)) AS Dispute
  FROM MonthlyData
 PIVOT(SUM(Total_Amount) FOR PaymentDescription IN ('Credit', 'Cash', 'NoCharge', 'Dispute'))
ORDER BY MonthNumber;

-- Query: See what day of the week in each month has the greatest amount (that's the month/day to work)
WITH WeekdayData AS
(
SELECT FORMAT_DATE("%B", Pickup_DateTime) AS MonthName,
       FORMAT_DATE("%m", Pickup_DateTime) AS MonthNumber,
       FORMAT_DATE("%A", Pickup_DateTime) AS WeekdayName,
       SUM(taxi_trips.Total_Amount) AS Total_Amount
  FROM `mec-test-344202.ds_edw.taxi_trips` AS taxi_trips
 WHERE taxi_trips.Pickup_DateTime BETWEEN '2022-01-01' AND '2022-02-01'
   AND cast(taxi_trips.payment_type as INT64) IN (1,2,3,4)
 GROUP BY 1, 2, 3
)
SELECT MonthName,
       FORMAT("%'d", CAST(Sunday    AS INTEGER)) AS Sunday,
       FORMAT("%'d", CAST(Monday    AS INTEGER)) AS Monday,
       FORMAT("%'d", CAST(Tuesday   AS INTEGER)) AS Tuesday,
       FORMAT("%'d", CAST(Wednesday AS INTEGER)) AS Wednesday,
       FORMAT("%'d", CAST(Thursday  AS INTEGER)) AS Thursday,
       FORMAT("%'d", CAST(Friday    AS INTEGER)) AS Friday,
       FORMAT("%'d", CAST(Saturday  AS INTEGER)) AS Saturday,
  FROM WeekdayData
 PIVOT(SUM(Total_Amount) FOR WeekdayName IN ('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))
ORDER BY MonthNumber;

数据仓库删除部署资源

terraform destroy
Terraform自动化创建BigQuery数据仓库

代码地址:
https://github.com/terraform-google-modules/terraform-google-bigquery/tree/645e382bfe830c3111f5fa6cbe7bae8f87e691f4

参考文档:
https://cloud.google.com/architecture/big-data-analytics/data-warehouse?hl=zh_CN&_ga=2.185801642.-2097802361.1683768839#delete-deployment

发表评论

您的电子邮箱地址不会被公开。