Extract Article from HTML

java-extractarticle extracts an article (main contents) from HTML.

Pre-requirements
Getting started
How to release

Pre-requirements

java-extractarticle is available on GitHub Packages. (Japanese version)

for Maven

Create a personal access token with read:packages permission at https://github.com/settings/tokens

Put username and token to your ~/.m2/settings.xml file with <server> tag.

<settings>
  <servers>
    <server>
      <id>github</id>
      <username>USERNAME</username>
      <password>YOUR_PERSONAL_ACCESS_TOKEN_WITH_READ</password>
    </server>
  </servers>
</settings>

Add a repository to your repositories section in project's pom.xml file.

<repository>
  <id>github</id>
  <url>https://maven.pkg.github.com/koron/java-extractarticle</url>
</repository>

Add a <dependency> tag to your <dependencies> tag.

<dependency>
  <groupId>net.kaoriya</groupId>
  <artifactId>extractarticle</artifactId>
  <version>0.0.1</version>
</dependency>

Please read public document also. (Japanese)

for Gradle

Create a personal access token with read:packages permission at https://github.com/settings/tokens

Put username and token to your ~/.gradle/gradle.properties file.

gpr.user=YOUR_USERNAME
gpr.key=YOUR_PERSONAL_ACCESS_TOKEN_WITH_READ:PACKAGES

Add a repository to your repositories section in build.gradle file.

maven {
    url = uri("https://maven.pkg.github.com/koron/java-extractarticle")
    credentials {
        username = project.findProperty("gpr.user") ?: System.getenv("USERNAME")
        password = project.findProperty("gpr.key") ?: System.getenv("TOKEN")
    }
}

Add an implementation to your dependencies section.

implementation 'net.kaoriya:extractarticle:0.0.1'

Please read public document also. (Japanese).

Getting Started

To be written.

import net.kaoriya.extractarticle.ArticleExtractor;

var r = ArticleExtractor.extract(new java.io.File("./foobar.html"));

// print the article (main contents). [String]
System.out.println(r.text); 

// print the description (meta[name='description']). [String]
System.out.println(r.desc);

// print the score: probability of main contents. [double: 0.0~1.0, NaN]
System.out.println(Double.toString(r.score));

Score

Extracted article may be wrong depending on HTML contents. To verify the extraction succeeded or not, this extractor provides score for it.

The score is calculated by considering the description is came from the article. In other words: how many ratio of description can be found in the article. So it will be very close to 1.0 if better extraction done.

(in Japanese)

extract() は稀に本文ではないところを抽出して失敗します。そのため「descriptionは本文を再利用していることが多い」ことを利用して descriptionと本文の類似度を得点化することで、本文抽出がどの程度成功しているかを見れるようにしたのが score フィールドです。

この得点は次のように計算しています。これはつまりdescriptionが本文中にどの程度含まれているかを示しています。

count({descの2-gram index} ∩ {本文の2-gram index})
--------------------------------------------------
            count({descの2-gram index})

この score がおよそ 0.8 から 0.9 を超えていればほぼ間違いなく正しく本文を抽出できたと考えられます。

またこの判定は以下のようなケースでは正しく機能しません。

そもそもdescriptionが設定されてない
descriptionに本文とは関係ない文章が設定されている

しかしそのようなコンテンツは稀であるか、あっても写真しかないブログのように本文に解析価値を見出しにくいケースであるため、本ライブラリでは救済策を用意していません。

`extract()` variations

ArticleExtractor.extract() has some variations. Most basic form is extract(org.jsoup.nodes.Document doc).

Other variations like this:

extract(java.lang.String html).
extract(java.io.File file) throws java.io.IOException.
extract(java.io.InputStream in) throws java.io.IOException

How to release

update version in build.gradle
./gradlew test
./gradlew publish

Set these properties with correct values in ~/.gradle/gradle.properties
```
gpr.user=
gpr.key=
```

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Extract Article from HTML

Pre-requirements

for Maven

for Gradle

Getting Started

Score

`extract()` variations

How to release

About

Uh oh!

Releases

Packages

Uh oh!

Languages

koron/java-extractarticle

Folders and files

Latest commit

History

Repository files navigation

Extract Article from HTML

Pre-requirements

for Maven

for Gradle

Getting Started

Score

extract() variations

How to release

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`extract()` variations

Packages