NLP Process Pipeline

This repository is adoped from the [Speech Analysis Framework] (https://github.com/GoogleCloudPlatform/dataflow-contact-center-speech-analysis)

It can and:

Process JSON file on GCS Bucket
Send the text in the JSON File to Cloud Natural Language APIs.
Write the output to BigQuery.

The process follows:

JSON files being uploaded to GCS bucket
The Cloud Dataflow job picks up the files and sends the file to Cloud Natural Language APIs one by one
The Cloud Dataflow job writes the result to BigQuery

How to install the Speech Analysis Framework

Install the Google Cloud SDK
Create a storage bucket for Dataflow Staging Files

gsutil mb gs://[BUCKET_NAME]/

Through the Google Cloud Console create a folder named tmp in the newly created bucket for the DataFlow staging files
Create a storage bucket for Uploaded JSON Files

gsutil mb gs://[BUCKET_NAME]/

Create a BigQuery Dataset

bq mk [YOUR_BIG_QUERY_DATABASE_NAME]

Enable Cloud Dataflow API

gcloud services enable dataflow

Enable Cloud Natural Language API

gcloud services enable language.googleapis.com

Deploy the Cloud Dataflow Pipeline

In the cloned repo, go to “saf-longrun-job-dataflow” directory and deploy the Cloud Dataflow Pipeline. Run the commands below to deploy the dataflow job.

# Apple/Linux
python -m venv env
source env/bin/activate
pip install apache-beam[gcp]
pip install Cython

or

# Windows
python -m venv env
env\Scripts\activate
pip install apache-beam[gcp]
pip install Cython

The Dataflow job will create the BigQuery Table you listed in the parameters.
Please wait as it might take a few minutes to complete.
Running in Local Runner

python3 gcs-nlp-bq-batch.py --project=[YOUR_PROJECT_ID]
--input=gs://[YOUR BUCKET]/*.json
--output_bigquery=[DATASET NAME].[TABLE] --requirements_file="requirements.txt"

Running in Dataflow Runner

python3 gcs-nlp-bq-batch.py --project=[YOUR_PROJECT_ID] --runner=DataflowRunner 
--input=gs://[YOUR BUCKET]/*.json
--temp_location=gs://[YOUR_DATAFLOW_STAGING_BUCKET]/tmp --output_bigquery=[DATASET NAME].[TABLE] --requirements_file="requirements.txt"

After a few minutes you will be able to see the data in BigQuery.

Sample select statements that can be executed in the BigQuery console.

-- Order Natural Language Entities for all records
SELECT
  *
FROM (
  SELECT
    entities.name,
    entities.type,
    COUNT(entities.name) AS count
  FROM
    `[YOUR_PROJECT_ID].[YOUR_DATASET].[YOUR_TABLE]`,
    UNNEST(entities) entities
  GROUP BY
    entities.name,
    entities.type
  ORDER BY
    count DESC )

-- Search Transcript with a regular expression
SELECT
  transcript,
  sentimentscore,
  magnitude
FROM
  `[YOUR_PROJECT_ID].[YOUR_DATASET].[YOUR_TABLE]`
WHERE
  (REGEXP_CONTAINS(transcript, '(?i) [YOUR_WORD]' ))

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
.DS_Store		.DS_Store
README.md		README.md
gcs-nlp-bq-batch.py		gcs-nlp-bq-batch.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP Process Pipeline

How to install the Speech Analysis Framework

About

Uh oh!

Releases

Packages

Uh oh!

Languages

cindyzhong/nlp-dataflow-pipeline

Folders and files

Latest commit

History

Repository files navigation

NLP Process Pipeline

How to install the Speech Analysis Framework

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages