Site Metadata Extractor

Cleans and extracts a web(site) resource's metadata.

Metadata extraction fields currently supported:

Name	Data Type
author	array (jsonb)
canonical_url	string
copyright	string
date (publish date)	date
description	text
favicon	text
image (primary/og image)	text
jsonld (structured data)	object (jsonb)
keywords	array (jsonb)
lang	string
locale	string
origin	string
publisher	string
site_name	string
tags	array (jsonb)
title	string
type	string
truncated_text	text
status	string
videos	array (jsonb)
links	array (jsonb)

Install

NPM:

$ npm install site-metadata-extractor --save

Yarn:

$ yarn add site-metadata-extractor

Usage

Feed in a raw markup from a webpage to get extracted metadata fields.

From .html file:

import fs from "fs";
import siteMetadataExtractor from "site-metadata-extractor";

const getMetadataFromFile = (filename) => {
  const filepath = path.resolve(__dirname, `../data/${filename}.html`);
  const markup = fs.readFileSync(filepath).toString();
  // feel free to use localhost as the second parameter for testing
  const metadata = siteMetadataExtractor(markup, "YOUR_SITE_ORIGIN_HERE");
  return metadata;
};

getMetadataFromFile("example");

From a server request:

import axios from 'axios';
import siteMetadataExtractor from 'site-metadata-extractor';

const processSite = async (url) => {
  return axios.get(url, config = {})
    .then(res => {
      const { headers } = res;
      const contentType = headers['content-type'];
      if (contentType.includes('text/html')) {
        return {
          body: res.data,
          url
        };
      } else {
        return {};
      }
    })
    .catch(err => {
      console.log(err);
    });
};

processSite('https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/`)
	.then((data) => {
		...
    siteMetadataExtractor(data, "https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/", "en");
    ...
	});

Development

Run: git clone https://github.com/sc10ntech/site-metadata-extractor.git
Change into project directory and install deps: cd site-metadata-extractor && npm i

Creids & Disclaimer

site-metadata-extractor was inspired by, and tries to be the spiritual successor to node-unfluff

Name		Name	Last commit message	Last commit date
Latest commit History 424 Commits
.github		.github
.husky		.husky
data/stopwords		data/stopwords
examples		examples
src		src
tests		tests
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.tool-versions		.tool-versions
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
commitlint.config.js		commitlint.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tslint.json		tslint.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Site Metadata Extractor

Install

Usage

Development

Creids & Disclaimer

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

sc10ntech/site-metadata-extractor

Folders and files

Latest commit

History

Repository files navigation

Site Metadata Extractor

Install

Usage

Development

Creids & Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages