Skip to content

Conversation

@shehadak
Copy link
Collaborator

@shehadak shehadak commented Nov 8, 2023

This PR builds on the Huggingface subject, which assumes that models are autoregressive (following the ModelForCausalLM). This PR adds support for bidirectional models with masked language modeling following the ModelForMaskedLM). Since bidirectional models rely on future context, I use a sliding window approach (see google-research/bert#66). In particular, for each text part, up to w/2 tokens are included for the current part + previous context, and the remaining w/2 tokens are masked.

The region_layer_mapping for the language system was determined by scoring every transformer layer in BERT's encoder against the Pereira2018.243sentences-linear, Pereira2018.384sentences-linear, and Blank2014-linear benchmarks, and choosing the layer with the highest average score.

This PR also provides unit tests for reading time estimation, next word prediction, and neural recording, using the bert-base-uncased model. Future models can use the same format, as long as they implement the ModelForMaskedLM interface. For example, to add the base DistilBERT model:

model_registry['distilbert-base-uncased'] = lambda: HuggingfaceSubject(model_id='distilbert-base-uncased', region_layer_mapping={
    ArtificialSubject.RecordingTarget.language_system: 'distilbert.transformer.layer.5'}, bidirectional=True)

@shehadak shehadak force-pushed the ks/bidirectional-huggingface branch from bc44421 to 66311de Compare November 9, 2023 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants