web-scraper-gcp

Scrape all the pages and links of a given domain and write the results to Google Cloud BigQuery.

Steps

Clone the repo locally
Create a Google Cloud Platform project, enable Compute Engine API and BigQuery API
Install latest version of gcloud SDK
Authenticate against gcloud SDK and set the project to the one you created
Edit config.json.sample -- a. Update "domain" to match what you consider an "internal" domain pattern -- b. Update "startUrl" to give the entry point for the crawl -- c. Update "projectId" to the GCP project ID -- d. Update "bigQuery.datasetId" and "bigQuery.tableId" to a dataset ID and table ID you want the script to create and write the results to.
If you want to use e.g. GCP Memorystore, set "redis.active" to true and update the host and port to match the Redis instance
Save the config.json.sample to config.json, and upload it to a Google Cloud Storage bucket
Edit gce-install.sh and update the bucket variable to the URL to the config file in Google Cloud Storage
Once ready, run

gcloud compute instances create web-scraper-gcp \
    --machine-type=n1-standard-16 \
    --metadata-from-file=startup-script=./gce-install.sh \
    --scopes=bigquery,cloud-platform \
    --zone=europe-north1-a

Feel free to change machine-type to something more or less powerful if you wish. Feel free to change the zone, too.

This will create a new Compute Engine instance called "web-scraper-gcp", which will run the crawl as soon as the instance is started. Once the crawl is over, the instance is automatically stopped.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bigquery-schema.json		bigquery-schema.json
config.json.sample		config.json.sample
config.schema.json		config.schema.json
gce-install.sh		gce-install.sh
index.js		index.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

web-scraper-gcp

Steps

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sahava/web-scraper-gcp

Folders and files

Latest commit

History

Repository files navigation

web-scraper-gcp

Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages