Skip to content

Batch processing differs in tagging output from single document processing #1472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
DZNLP opened this issue Apr 2, 2025 · 8 comments
Open
Labels

Comments

@DZNLP
Copy link

DZNLP commented Apr 2, 2025

Hello!

We have been using Stanza 1.10.1 with single document processing but want to switch to batch processing to increase speed. For that, we ran some benchmarks, among other things comparing the results of single processing and batch processing.

From what we can see the results of processing a text individually is stable - we always get the same tagging result. If we process texts via batch processing, we get different results compared to processing the text individually.

We have tried

  • different batch sizes (25. 50, 100, 500)
  • different types of data (user generated content, Wikipedia text)
  • different models (DE, FR, KO)
  • checked for multiple line breaks in our data to avoid issues with the concatenation of documents in the tokeniser

We're seeing differences in sentence splitting and tokenisation (especially end-of-sentence punctuation) which then leads to changes further down the pipeline (lemmatisation, dependency relationships).

Key code snippets

   self.nlp = stanza.Pipeline(lang = 'de',
                                   processors = 'tokenize,mwt,pos,lemma,depparse',
                                   package = 'gsd')

    def get_docs(self, texts):
        documents = [Document([], text=doc_content) for doc_content in texts]
        return self.nlp(documents)
		
   def get_documents(model, data, lang):
	 for batch in data:
	      keys, text = map(list, zip(*batch))
	      result_documents = model.get_docs(text)

The comparison is done on the JSON representation of the Stanza document.

Attached is the test data we used. We find 13 documents with differences in 1000 tagged documents overall. Is there any way to mitigate this issue?

wiki_de.csv

@DZNLP DZNLP added the bug label Apr 2, 2025
@AngledLuffa
Copy link
Collaborator

I will put this on my list to debug tomorrow

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Apr 4, 2025

Well, I haven't fixed it, but I can confirm an issue with an even smaller test case. The 50 length file has an extra sentence split off.
Will continue debugging tomorrow... I have some theories, at least

import stanza
from stanza.models.common.doc import Document
from stanza.utils.conll import CoNLL

from tqdm import tqdm

#nlp = stanza.Pipeline(lang = 'de',
#                      processors = 'tokenize,mwt,pos,lemma,depparse',
#                      package = 'gsd')

nlp = stanza.Pipeline(lang = 'de',
                      processors = 'tokenize',
                      package = 'gsd')

def get_docs(nlp, data):
    documents = [Document([], text=doc_content) for doc_content in data]
    return nlp(documents)

def get_documents(nlp, data, batch_size):
    docs = []
    for start_text in tqdm(range(0, len(data), batch_size)):
        end_text = start_text + batch_size
        docs.extend(get_docs(nlp, data[start_text:end_text]))
    return docs

with open("wiki_de.csv") as fin:
    lines = fin.readlines()
    lines = lines[1:]

#data = [line.strip().split("\t", maxsplit=1)[1][1:-1] for line in lines]
data = lines

for length in (25, 33, 50, 100):
    docs = get_documents(nlp, data, length)

    sent_id = 0
    for doc in docs:
        for sentence in doc.sentences:
            sentence.sent_id = sent_id
            sent_id += 1
        CoNLL.write_doc2conll(doc, "wiki_de_%d.conllu" % length, mode='a')

@AngledLuffa
Copy link
Collaborator

Alright, my current theory is:

  • I isolated which sentence is changing based on batch size of 25 or 50
  • In one batch size, that sentence is the longest sentence in its batch. In the other batch size, it is part of a batch with longer sentences
  • When it's the longest sentence, the model only has one "pad" character to the right of the sentence. Otherwise, there are quite a few
  • The model is a bidirectional LSTM over the characters, meaning that when going from right to left, it sees a different number of pads and therefore gets slightly different results

I'll see if there's some better way to mask it so that it doesn't actually care about how many pads there are

@AngledLuffa
Copy link
Collaborator

Well, the good news is, it is definitely possible to make it stable. The bad news is, this requires either NestedTensor or PackedSequence. Unfortunately, the LSTM we use isn't compatible with NestedTensor, at least as of torch 2.5.1. That leaves PackedSequence, and that's about 20% slower when running the tokenizer. Very strange. Running a Pipeline of tokenize,pos,lemma,depparse is not that much slower, but it is still slower...

still, it'll be more correct, I suppose

@DZNLP
Copy link
Author

DZNLP commented Apr 7, 2025

Hello!

Thank you for the quick analysis. It's quite unfortunate that this causes a decrease in speed but it should still be faster than single processing, correct?

@AngledLuffa
Copy link
Collaborator

Conferred with my PI (@manning) and we were thinking, maybe just make the PackedSequence the default behavior so that it's always repeatable results. Just take the hit on the efficiency...

@Jemoka any objections?

@Jemoka
Copy link
Member

Jemoka commented Apr 11, 2025

PackedSequence sounds good. Surely its not that much of a hit. Would probably also make MPS support better too.

AngledLuffa added a commit that referenced this issue Apr 12, 2025
Sort in the other direction means we don't need to use enforce_sorted=False

Things are faster without the packed sequences, unfortunately, but they wind up having unstable results:

#1472
@AngledLuffa
Copy link
Collaborator

Unfortunately, NestedTensor doesn't support these operations, at least as of Torch 2.4. Perhaps one day.

AngledLuffa added a commit that referenced this issue Apr 12, 2025
Sort in the other direction means we don't need to use enforce_sorted=False

Things are faster without the packed sequences, unfortunately, but they wind up having unstable results:

#1472
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants