This repository is adoped from the [Speech Analysis Framework] (https://github.com/GoogleCloudPlatform/dataflow-contact-center-speech-analysis)
It can and:
- Process JSON file on GCS Bucket
- Send the text in the JSON File to Cloud Natural Language APIs.
- Write the output to BigQuery.
The process follows:
- JSON files being uploaded to GCS bucket
- The Cloud Dataflow job picks up the files and sends the file to Cloud Natural Language APIs one by one
- The Cloud Dataflow job writes the result to BigQuery
-
Create a storage bucket for Dataflow Staging Files
gsutil mb gs://[BUCKET_NAME]/
-
Through the Google Cloud Console create a folder named tmp in the newly created bucket for the DataFlow staging files
-
Create a storage bucket for Uploaded JSON Files
gsutil mb gs://[BUCKET_NAME]/
- Create a BigQuery Dataset
bq mk [YOUR_BIG_QUERY_DATABASE_NAME]
- Enable Cloud Dataflow API
gcloud services enable dataflow
- Enable Cloud Natural Language API
gcloud services enable language.googleapis.com
- Deploy the Cloud Dataflow Pipeline
- In the cloned repo, go to “saf-longrun-job-dataflow” directory and deploy the Cloud Dataflow Pipeline. Run the commands below to deploy the dataflow job.
# Apple/Linux
python -m venv env
source env/bin/activate
pip install apache-beam[gcp]
pip install Cython
or
# Windows
python -m venv env
env\Scripts\activate
pip install apache-beam[gcp]
pip install Cython
-
The Dataflow job will create the BigQuery Table you listed in the parameters.
-
Please wait as it might take a few minutes to complete.
-
Running in Local Runner
python3 gcs-nlp-bq-batch.py --project=[YOUR_PROJECT_ID]
--input=gs://[YOUR BUCKET]/*.json
--output_bigquery=[DATASET NAME].[TABLE] --requirements_file="requirements.txt"
- Running in Dataflow Runner
python3 gcs-nlp-bq-batch.py --project=[YOUR_PROJECT_ID] --runner=DataflowRunner
--input=gs://[YOUR BUCKET]/*.json
--temp_location=gs://[YOUR_DATAFLOW_STAGING_BUCKET]/tmp --output_bigquery=[DATASET NAME].[TABLE] --requirements_file="requirements.txt"
- After a few minutes you will be able to see the data in BigQuery.
- Sample select statements that can be executed in the BigQuery console.
-- Order Natural Language Entities for all records
SELECT
*
FROM (
SELECT
entities.name,
entities.type,
COUNT(entities.name) AS count
FROM
`[YOUR_PROJECT_ID].[YOUR_DATASET].[YOUR_TABLE]`,
UNNEST(entities) entities
GROUP BY
entities.name,
entities.type
ORDER BY
count DESC )
-- Search Transcript with a regular expression
SELECT
transcript,
sentimentscore,
magnitude
FROM
`[YOUR_PROJECT_ID].[YOUR_DATASET].[YOUR_TABLE]`
WHERE
(REGEXP_CONTAINS(transcript, '(?i) [YOUR_WORD]' ))