-
Notifications
You must be signed in to change notification settings - Fork 904
Batch processing differs in tagging output from single document processing #1472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I will put this on my list to debug tomorrow |
Well, I haven't fixed it, but I can confirm an issue with an even smaller test case. The 50 length file has an extra sentence split off.
|
Alright, my current theory is:
I'll see if there's some better way to mask it so that it doesn't actually care about how many pads there are |
Well, the good news is, it is definitely possible to make it stable. The bad news is, this requires either NestedTensor or PackedSequence. Unfortunately, the LSTM we use isn't compatible with NestedTensor, at least as of torch 2.5.1. That leaves PackedSequence, and that's about 20% slower when running the tokenizer. Very strange. Running a Pipeline of tokenize,pos,lemma,depparse is not that much slower, but it is still slower... still, it'll be more correct, I suppose |
Hello! Thank you for the quick analysis. It's quite unfortunate that this causes a decrease in speed but it should still be faster than single processing, correct? |
PackedSequence sounds good. Surely its not that much of a hit. Would probably also make MPS support better too. |
Sort in the other direction means we don't need to use enforce_sorted=False Things are faster without the packed sequences, unfortunately, but they wind up having unstable results: #1472
Unfortunately, NestedTensor doesn't support these operations, at least as of Torch 2.4. Perhaps one day. |
Sort in the other direction means we don't need to use enforce_sorted=False Things are faster without the packed sequences, unfortunately, but they wind up having unstable results: #1472
Hello!
We have been using Stanza 1.10.1 with single document processing but want to switch to batch processing to increase speed. For that, we ran some benchmarks, among other things comparing the results of single processing and batch processing.
From what we can see the results of processing a text individually is stable - we always get the same tagging result. If we process texts via batch processing, we get different results compared to processing the text individually.
We have tried
We're seeing differences in sentence splitting and tokenisation (especially end-of-sentence punctuation) which then leads to changes further down the pipeline (lemmatisation, dependency relationships).
Key code snippets
The comparison is done on the JSON representation of the Stanza document.
Attached is the test data we used. We find 13 documents with differences in 1000 tagged documents overall. Is there any way to mitigate this issue?
wiki_de.csv
The text was updated successfully, but these errors were encountered: