Skip to content

mideind/IceBERT-PoS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IceBERT PoS Tagging Interface

A high-level Python interface for PoS tagging Icelandic text using the IceBERT-PoS model with classical tokenization.

TODOs

  • Add license information
  • Proper device handling (GPU) for tensors

Installation

# This package is currently not available on PyPI, so you need to install it directly from the source repository.

# Without PyTorch (lighter, but model inference won't work)
pip install git+ssh://[email protected]/mideind/IceBERT-PoS.git

# With PyTorch support (required for model inference) - RECOMMENDED
pip install "git+ssh://[email protected]/mideind/IceBERT-PoS.git[torch]"

Note: The [torch] extra is required for model inference, as PyTorch models need PyTorch to run. The default installation is only useful for development work that doesn't involve running the actual models.

Features

  • Classical Tokenization: Uses the Miðeind tokenizer for Icelandic tokenziation
  • Character Positions: Preserves exact character start/end positions in original text
  • Sentence-Aware Processing: Maintains sentence boundaries and processes them in batches
  • Dual Format Output: Provides both IFD tags and structured category/features
  • Caller-owned Model: Load model once, reuse for multiple calls
  • Batch Processing: Efficient processing of multiple sentences

Usage

Command Line Interface

After installation, you can use the icebert-pos command:

# Basic POS tagging with full IFD tags
icebert-pos "Þetta er stutt sýnidæmi."
# Þetta[fahen] er[sfg3en] stutt[lhensf] sýnidæmi[nhen].[pl]

# Get only POS categories (without detailed features)
icebert-pos --only-category "Þetta er stutt sýnidæmi."
# Þetta[fa] er[sf] stutt[l] sýnidæmi[n].[pl]

# Get structured json output
icebert-pos --json "Þetta er stutt sýnidæmi."
# [
#   [
#     {
#       "text": "Þetta",
#       "char_start": 0,
#       "char_end": 5,
#       "category": "fa",
#       "features": [
#         "neut",
#         "sing",
#         "nom"
#       ],
#       "ifd_tag": "fahen"
#     },
#     ...
#     {
#       "text": ".",
#       "char_start": 23,
#       "char_end": 24,
#       "category": "pl",
#       "features": [],
#       "ifd_tag": "pl"
#     }
#   ]
# ]

# Default behavior is to split composite tokens (like "samskipta- og kynningarstýra") into individual tokens
icebert-pos "samskipta- og kynningarstýra"
# 3 tokens:
# samskipta-[kt] og[c] kynningarstýra[nven]
icebert-pos --keep-composite-tokens "samskipta- og kynningarstýra"
# 1 token:
# samskipta- og kynningarstýra[nven]

# Enable debug logging
icebert-pos --debug "Þetta er stutt sýnidæmi."
# lots of output

There are some additional command line options available, run icebert-pos --help to see them.

Python API

Simple Usage

from icebert_pos import pos_tag_text, TaggedToken
from transformers import AutoModel, AutoTokenizer
import torch

# Load model and tokenizer, you need to have trust_remote_code=True to load the custom model code.
# You can check the model repository for details: https://huggingface.co/mideind/IceBERT-PoS
model = AutoModel.from_pretrained("mideind/IceBERT-PoS", trust_remote_code=True)
# set the model to evaluation mode - otherwise the output will be stochastic
model.eval()
# place the model on the appropriate device (CPU/GPU)
model.to("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("mideind/IceBERT-PoS")

text = "Þetta er stutt sýnidæmi."
# POS tag text - returns List[List[TaggedToken]]
sentence_results = pos_tag_text(text, model, tokenizer)
assert sentence_results == [
    [
        TaggedToken(text="Þetta", char_start=0, char_end=5, category="fa", features=["neut", "sing", "nom"], ifd_tag="fahen"),
        TaggedToken(text="er", char_start=6, char_end=8, category="sf", features=["sing", "act", "3", "pres"], ifd_tag="sfg3en"),
        TaggedToken(text="stutt", char_start=9, char_end=14, category="l", features=["neut", "sing", "nom", "strong", "pos"], ifd_tag="lhensf"),
        TaggedToken(text="sýnidæmi", char_start=15, char_end=23, category="n", features=["neut", "sing", "nom"], ifd_tag="nhen"),
        TaggedToken(text=".", char_start=23, char_end=24, category="pl", features=[], ifd_tag="pl")
    ]
]

Batch Processing for Efficiency

# For processing multiple sentences efficiently
# The Miðeind tokenizer will split this string into 3 sentences and process them in batches
texts = ["Fyrsti texti.", "Annar texti.", "Þriðji texti."]
# The batching is done automatically by the pos_tag_text function and this will call model.forward twice
sentence_results = pos_tag_text("\n".join(texts), model, tokenizer, batch_size=2)
assert len(sentence_results) == 3  # Should return 3 sentences

Advanced Usage with Lower-Level Functions

from icebert_pos import (
    segment_text_to_sentences,
    prepare_sentence,
    batch_sentences,
    predict_sentences
)
# Same example as before
text = "Þetta er stutt sýnidæmi."
# Segment text into sentences
sentences = segment_text_to_sentences(text)

# Prepare individual sentences
sentence_tensors = []
for sentence in sentences:
    tensors = prepare_sentence(sentence, model, tokenizer, truncate=True)
    sentence_tensors.append(tensors)

# Batch multiple sentences for efficient processing
batch_input_ids, batch_attention_mask, batch_word_mask = batch_sentences(
    sentence_tensors, tokenizer
)

# Get raw predictions
predictions = predict_sentences(
    batch_input_ids, batch_attention_mask, batch_word_mask, model
)

# predictions is List[List[Tuple[str, List[str]]]]
# - List of sentences
# - Each sentence has List of (category, features) tuples for each word
assert predictions == [
    [
        ("fa", ["neut", "sing", "nom"]),
        ("sf", ["sing", "act", "3", "pres"]),
        ("l", ["neut", "sing", "nom", "strong", "pos"]),
        ("n", ["neut", "sing", "nom"]),
        ("pl", [])
    ]
]

Data Structures

Token

Basic token with text and position:

  • text: The token text
  • char_start: Start position in original text
  • char_end: End position in original text

Sentence

Collection of tokens representing a sentence:

  • tokens: List of Token objects

TaggedToken

Token with POS tagging information (extends Token):

  • text: The token text
  • char_start: Start position in original text
  • char_end: End position in original text
  • category: POS category (e.g., "fp", "sfg")
  • features: List of morphological features (e.g., ["1", "sing", "nom"])
  • ifd_tag: Full IFD POS tag (e.g., "fp1en", "sfg3en")

API Reference

High-Level Functions

  • pos_tag_text(text, model, tokenizer, batch_size=1, split_composite_tokens=True, truncate=False) - Main function for POS tagging
  • segment_text_to_sentences(text, split_composite_tokens=True) - Segment text into sentences using classical tokenization

Parameters

  • batch_size: Number of sentences to process in each batch for efficiency (default: 1)
  • split_composite_tokens: Whether to split composite tokens (like "samskipta- og kynningarstýra") into individual tokens on whitespace (default: True)
  • truncate: Whether to truncate input sequences that exceed the model's maximum length. If False, long sentences may cause errors (default: False)

Lower-Level Functions

  • prepare_sentence(sentence, model, tokenizer, truncate=False) - Prepare tensors for a single sentence
  • batch_sentences(sentence_tensors, tokenizer) - Batch multiple sentence tensors
  • predict_sentences(input_ids, attention_mask, word_mask, model) - Get raw predictions from model

When using the lower-level functions you can control more of the processing but will also need to handle device placement and batching manually.

License

TODO: Add license information here.

Copyright (C) Miðeind ehf.

About

A package for interacting with the IceBERT PoS (part-of-speech analysis) model(s).

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages