Skip to content

Commit 5020fb5

Browse files
pkattamuriJacob Ferriero
authored and
Jacob Ferriero
committed
Feature/hive bigquery (#139)
* Code for migrating Hive to BigQuery added * code added * Readme file added * File renamed * optional arguments added * pylint enhancements * Formatted with pylint * Changes in readme,added script to generate data * Changes in readme,added script to generate data * consistent string formatting,narrow exceptions * Changes in readme,added script to generate data * Changes in readme,added script to generate data * consistent string formatting,narrow exceptions * More details on cloud-sql-proxy * Cloud KMS,config file,handled long hive table names * removed unnecessary comments * changed pip to pip3 * modified prereq,readme
1 parent 6897968 commit 5020fb5

29 files changed

+4405
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ This script helps you create a Cloudera parcel that includes Google Cloud Storag
4646
* [GCS Bucket Mover](tools/gcs-bucket-mover) - A tool to move user's bucket, including objects, metadata, and ACL, from one project to another.
4747
* [GKE Billing Export](tools/gke-billing-export) - Google Kubernetes Engine fine grained billing export.
4848
* [GSuite Exporter](tools/gsuite-exporter/) - A Python package that automates syncing Admin SDK APIs activity reports to a GCP destination. The module takes entries from the chosen Admin SDK API, converts them into the appropriate format for the destination, and exports them to a destination (e.g: Stackdriver Logging).
49+
* [Hive to BigQuery](tools/hive-bigquery/) - A Python framework to migrate Hive table to BigQuery using Cloud SQL to keep track of the migration progress.
4950
* [LabelMaker](tools/labelmaker) - A tool that reads key:value pairs from a json file and labels the running instance and all attached drives accordingly.
5051
* [Maven Archetype Dataflow](tools/maven-archetype-dataflow) - A maven archetype which bootstraps a Dataflow project with common plugins pre-configured to help maintain high code quality.
5152
* [Netblock Monitor](tools/netblock-monitor) - An Apps Script project that will automatically provide email notifications when changes are made to Google’s IP ranges.

tools/hive-bigquery/README.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Migrate Hive tables to BigQuery
2+
3+
This framework migrates data from Hive to BigQuery using Cloud SQL to keep track of the migration progress. This is designed in such a way that it can handle migrating data both in an incremental fashion or bulk load at once. Bulk loading is when data under the Hive table is stable and all the data can be migrated at once. Incremental load is when data is continuously appending to the existing Hive table and there is a need to migrate the incremental data only from the subsequent runs of migration.
4+
By default, this framework is expected to run on a source Hive cluster.
5+
6+
# Architectural Diagram
7+
Cloud SQL keeps track of the
8+
1. data in the source Hive table.
9+
2. files which are copied from the Hive cluster to Cloud Storage.
10+
3. files which are loaded from Cloud Storage into BigQuery.
11+
12+
![Alt text](architectural_diagram.png?raw=true)
13+
14+
# Before you begin
15+
1. Install gcloud SDK on the source Hive cluster by following these [instructions](https://cloud.google.com/sdk/install).
16+
2. [Create a service account](https://cloud.google.com/iam/docs/creating-managing-service-accounts#creating_a_service_account) and grant the following roles to the service account by following these [instructions](https://cloud.google.com/iam/docs/granting-roles-to-service-accounts#granting_access_to_a_service_account_for_a_resource).
17+
1. Google Cloud Storage - storage.admin
18+
2. BigQuery - bigquery.dataEditor and bigquery.jobUser
19+
3. Cloud SQL - cloudsql.admin
20+
4. Cloud KMS - cloudkms.cryptoKeyEncrypterDecrypter
21+
3. [Download](https://cloud.google.com/iam/docs/creating-managing-service-account-keys#creating_service_account_keys) the service account key file and set up the environment variable by following these [instructions](https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable).
22+
23+
```
24+
export GOOGLE_APPLICATION_CREDENTIALS="[PATH_TO_KEY_FILE]"
25+
```
26+
Note: Saving credentials in environment variables is convenient, but not secure - consider a more secure solution such as [Cloud KMS](https://cloud.google.com/kms/) to help keep secrets safe.
27+
28+
# Store Hive password using Cloud KMS
29+
1. Create a Cloud KMS key ring and a key. Refer the
30+
[documentation](https://cloud.google.com/kms/docs/creating-keys#top_of_page) for more instructions.
31+
```
32+
# Create a key ring
33+
gcloud kms keyrings create [KEYRING_NAME] --location [LOCATION]
34+
# Create a key
35+
gcloud kms keys create [KEY_NAME] --location [LOCATION] \
36+
--keyring [KEYRING_NAME] --purpose encryption
37+
```
38+
39+
2. On your local machine, create the file, for example, `password.txt`, that
40+
contains the password.
41+
3. Encrypt the file `password.txt` using the key and key ring that has been created.
42+
```
43+
gcloud kms encrypt \
44+
--location=[LOCATION] \
45+
--keyring=[KEY_RING] \
46+
--key=[KEY] \
47+
--plaintext-file=password.txt \
48+
--ciphertext-file=password.txt.enc
49+
```
50+
4. Upload the encrypted file, `password.txt.enc`, to the GCS bucket. Note this
51+
file location, which will be provided later as an input to the migration tool.
52+
```
53+
gsutil cp password.txt.enc gs://<BUCKET_NAME>/<OBJECT_PATH>
54+
```
55+
5. Delete the plaintext `password.txt` file from the local machine.
56+
# Usage
57+
58+
1. Clone the repo.
59+
```
60+
git clone https://github.com/GoogleCloudPlatform/professional-services.git
61+
cd professional-services/tools/hive-bigquery/
62+
```
63+
2. Install prerequisites such as python3, pip3, virtualenv and Cloud SQL proxy.
64+
```
65+
sudo bash prerequisites/prerequisites.sh
66+
```
67+
3. Activate the virtual environment.
68+
```
69+
virtualenv env
70+
source env/bin/activate
71+
pip3 install -r prerequisites/requirements.txt
72+
```
73+
4.
74+
The command below creates a Cloud SQL MySQL database instance and a database for
75+
storing the tracking information. If you want to use an existing instance, create the
76+
required database separately.
77+
78+
Set the root password when the script prompts for it.
79+
This will output the connection name of the instance which is of use in the
80+
next steps.
81+
```
82+
sh prerequisites/create_sql_instance.sh <INSTANCE_NAME> <DATABASE_NAME> <OPTIONAL_GCP_REGION>
83+
```
84+
5. Start the Cloud SQL proxy by providing the instance connection name obtained
85+
from the previous step and a TCP port (generally 3306 is used) on which the
86+
connection will be established.
87+
```
88+
/usr/local/bin/cloud_sql_proxy -instances=<INSTANCE_CONNECTION_NAME>=tcp:<PORT> &
89+
```
90+
6. Verify you are able to connect to the Cloud SQL database by running the
91+
command below. Provide the port which you used in the previous step and password
92+
for the root user that you have set in step 4.
93+
```
94+
mysql -h 127.0.0.1 -u root -P <PORT> -p
95+
```
96+
7. Create the tracking_table_info metatable in your MySQL database by importing
97+
the [prerequisites/tracking_table.sql](prerequisites/tracking_table.sql) file.
98+
This table will contain information about the migrated Hive tables and their properties.
99+
```
100+
mysql -h 127.0.0.1 -u root -P <PORT> <DATABASE_NAME> -p < prerequisites/tracking_table.sql
101+
```
102+
103+
8. Usage
104+
```
105+
usage: hive_to_bigquery.py [-h] --config-file CONFIG_FILE
106+
107+
Framework to migrate Hive tables to BigQuery which uses Cloud SQL to keep
108+
track of the migration progress.
109+
110+
optional arguments:
111+
-h, --help show this help message and exit
112+
113+
required arguments:
114+
--config-file CONFIG_FILE
115+
Input configurations JSON file.
116+
```
117+
9. Run [hive_to_bigquery.py](hive_to_bigquery.py).
118+
```
119+
python3 hive_to_bigquery.py \
120+
--config-file <CONFIG_FILE>
121+
```
122+
123+
# Test Run
124+
It is recommended to perform a test run before actually migrating your Hive
125+
table. To do so, you can use the [generate_data.py](test/generate_data.py) to
126+
randomly generate data (with specified size) and use
127+
[create_hive_tables.sql](test/create_hive_tables.sql) to create Hive tables in
128+
the default database on the source Hive cluster, both non partitioned and
129+
partitioned in different formats.
130+
131+
On the Hive cluster, run the command below to generate ~1GB of data.
132+
```
133+
python3 test/generate_data.py --size-in-gb 1
134+
```
135+
Run the command below to create Hive tables on the Hive cluster.
136+
```
137+
hive -f tools/create_hive_tables.sql
138+
```
139+
The config file `test_config.json` can be used to migrate the Hive table
140+
`text_nonpartitioned` which has an incremental column `int_column`. Replace the
141+
other parameters with appropriate values. Run the command below to migrate
142+
this table.
143+
```
144+
python3 hive_to_bigquery.py --config-file test_config.json
145+
```
56.2 KB
Loading

0 commit comments

Comments
 (0)