IceBERT PoS Tagging Interface

A high-level Python interface for PoS tagging Icelandic text using the IceBERT-PoS model with classical tokenization.

TODOs

Add license information
Proper device handling (GPU) for tensors

Installation

# This package is currently not available on PyPI, so you need to install it directly from the source repository.

# Without PyTorch (lighter, but model inference won't work)
pip install git+ssh://[email protected]/mideind/IceBERT-PoS.git

# With PyTorch support (required for model inference) - RECOMMENDED
pip install "git+ssh://[email protected]/mideind/IceBERT-PoS.git[torch]"

Note: The [torch] extra is required for model inference, as PyTorch models need PyTorch to run. The default installation is only useful for development work that doesn't involve running the actual models.

Features

Classical Tokenization: Uses the Miðeind tokenizer for Icelandic tokenziation
Character Positions: Preserves exact character start/end positions in original text
Sentence-Aware Processing: Maintains sentence boundaries and processes them in batches
Dual Format Output: Provides both IFD tags and structured category/features
Caller-owned Model: Load model once, reuse for multiple calls
Batch Processing: Efficient processing of multiple sentences

Usage

Command Line Interface

After installation, you can use the icebert-pos command:

# Basic POS tagging with full IFD tags
icebert-pos "Þetta er stutt sýnidæmi."
# Þetta[fahen] er[sfg3en] stutt[lhensf] sýnidæmi[nhen].[pl]

# Get only POS categories (without detailed features)
icebert-pos --only-category "Þetta er stutt sýnidæmi."
# Þetta[fa] er[sf] stutt[l] sýnidæmi[n].[pl]

# Get structured json output
icebert-pos --json "Þetta er stutt sýnidæmi."
# [
#   [
#     {
#       "text": "Þetta",
#       "char_start": 0,
#       "char_end": 5,
#       "category": "fa",
#       "features": [
#         "neut",
#         "sing",
#         "nom"
#       ],
#       "ifd_tag": "fahen"
#     },
#     ...
#     {
#       "text": ".",
#       "char_start": 23,
#       "char_end": 24,
#       "category": "pl",
#       "features": [],
#       "ifd_tag": "pl"
#     }
#   ]
# ]

# Default behavior is to split composite tokens (like "samskipta- og kynningarstýra") into individual tokens
icebert-pos "samskipta- og kynningarstýra"
# 3 tokens:
# samskipta-[kt] og[c] kynningarstýra[nven]
icebert-pos --keep-composite-tokens "samskipta- og kynningarstýra"
# 1 token:
# samskipta- og kynningarstýra[nven]

# Enable debug logging
icebert-pos --debug "Þetta er stutt sýnidæmi."
# lots of output

There are some additional command line options available, run icebert-pos --help to see them.

Python API

Simple Usage

from icebert_pos import pos_tag_text, TaggedToken
from transformers import AutoModel, AutoTokenizer
import torch

# Load model and tokenizer, you need to have trust_remote_code=True to load the custom model code.
# You can check the model repository for details: https://huggingface.co/mideind/IceBERT-PoS
model = AutoModel.from_pretrained("mideind/IceBERT-PoS", trust_remote_code=True)
# set the model to evaluation mode - otherwise the output will be stochastic
model.eval()
# place the model on the appropriate device (CPU/GPU)
model.to("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("mideind/IceBERT-PoS")

text = "Þetta er stutt sýnidæmi."
# POS tag text - returns List[List[TaggedToken]]
sentence_results = pos_tag_text(text, model, tokenizer)
assert sentence_results == [
    [
        TaggedToken(text="Þetta", char_start=0, char_end=5, category="fa", features=["neut", "sing", "nom"], ifd_tag="fahen"),
        TaggedToken(text="er", char_start=6, char_end=8, category="sf", features=["sing", "act", "3", "pres"], ifd_tag="sfg3en"),
        TaggedToken(text="stutt", char_start=9, char_end=14, category="l", features=["neut", "sing", "nom", "strong", "pos"], ifd_tag="lhensf"),
        TaggedToken(text="sýnidæmi", char_start=15, char_end=23, category="n", features=["neut", "sing", "nom"], ifd_tag="nhen"),
        TaggedToken(text=".", char_start=23, char_end=24, category="pl", features=[], ifd_tag="pl")
    ]
]

Batch Processing for Efficiency

# For processing multiple sentences efficiently
# The Miðeind tokenizer will split this string into 3 sentences and process them in batches
texts = ["Fyrsti texti.", "Annar texti.", "Þriðji texti."]
# The batching is done automatically by the pos_tag_text function and this will call model.forward twice
sentence_results = pos_tag_text("\n".join(texts), model, tokenizer, batch_size=2)
assert len(sentence_results) == 3  # Should return 3 sentences

Advanced Usage with Lower-Level Functions

from icebert_pos import (
    segment_text_to_sentences,
    prepare_sentence,
    batch_sentences,
    predict_sentences
)
# Same example as before
text = "Þetta er stutt sýnidæmi."
# Segment text into sentences
sentences = segment_text_to_sentences(text)

# Prepare individual sentences
sentence_tensors = []
for sentence in sentences:
    tensors = prepare_sentence(sentence, model, tokenizer, truncate=True)
    sentence_tensors.append(tensors)

# Batch multiple sentences for efficient processing
batch_input_ids, batch_attention_mask, batch_word_mask = batch_sentences(
    sentence_tensors, tokenizer
)

# Get raw predictions
predictions = predict_sentences(
    batch_input_ids, batch_attention_mask, batch_word_mask, model
)

# predictions is List[List[Tuple[str, List[str]]]]
# - List of sentences
# - Each sentence has List of (category, features) tuples for each word
assert predictions == [
    [
        ("fa", ["neut", "sing", "nom"]),
        ("sf", ["sing", "act", "3", "pres"]),
        ("l", ["neut", "sing", "nom", "strong", "pos"]),
        ("n", ["neut", "sing", "nom"]),
        ("pl", [])
    ]
]

Data Structures

Token

Basic token with text and position:

text: The token text
char_start: Start position in original text
char_end: End position in original text

Sentence

Collection of tokens representing a sentence:

tokens: List of Token objects

TaggedToken

Token with POS tagging information (extends Token):

text: The token text
char_start: Start position in original text
char_end: End position in original text
category: POS category (e.g., "fp", "sfg")
features: List of morphological features (e.g., ["1", "sing", "nom"])
ifd_tag: Full IFD POS tag (e.g., "fp1en", "sfg3en")

API Reference

High-Level Functions

pos_tag_text(text, model, tokenizer, batch_size=1, split_composite_tokens=True, truncate=False) - Main function for POS tagging
segment_text_to_sentences(text, split_composite_tokens=True) - Segment text into sentences using classical tokenization

Parameters

batch_size: Number of sentences to process in each batch for efficiency (default: 1)
split_composite_tokens: Whether to split composite tokens (like "samskipta- og kynningarstýra") into individual tokens on whitespace (default: True)
truncate: Whether to truncate input sequences that exceed the model's maximum length. If False, long sentences may cause errors (default: False)

Lower-Level Functions

prepare_sentence(sentence, model, tokenizer, truncate=False) - Prepare tensors for a single sentence
batch_sentences(sentence_tensors, tokenizer) - Batch multiple sentence tensors
predict_sentences(input_ids, attention_mask, word_mask, model) - Get raw predictions from model

When using the lower-level functions you can control more of the processing but will also need to handle device placement and batching manually.

License

TODO: Add license information here.

Copyright (C) Miðeind ehf.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
src/icebert_pos		src/icebert_pos
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IceBERT PoS Tagging Interface

TODOs

Installation

Features

Usage

Command Line Interface

Python API

Simple Usage

Batch Processing for Efficiency

Advanced Usage with Lower-Level Functions

Data Structures

Token

Sentence

TaggedToken

API Reference

High-Level Functions

Parameters

Lower-Level Functions

License

About

Uh oh!

Releases

Contributors 2

Uh oh!

Languages

mideind/IceBERT-PoS

Folders and files

Latest commit

History

Repository files navigation

IceBERT PoS Tagging Interface

TODOs

Installation

Features

Usage

Command Line Interface

Python API

Simple Usage

Batch Processing for Efficiency

Advanced Usage with Lower-Level Functions

Data Structures

Token

Sentence

TaggedToken

API Reference

High-Level Functions

Parameters

Lower-Level Functions

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Uh oh!

Languages