Skip to content

A regular expression to match all emoji-only symbols

License

Notifications You must be signed in to change notification settings

slevithan/emoji-regex-xs

Repository files navigation

emoji-regex-xs

npm version npm downloads bundle

This is a drop-in replacement for the emoji-regex package that shares its API and passes all of its emoji tests, but reduces its uncompressed size by more than 98% (from ~13 kB to ~0.2 kB). It manages this by relying on the Unicode version built into the JavaScript environment, rather than hard coding all emoji code points from a specific Unicode version.

Install and use

Via npm:

npm install emoji-regex-xs

In Node.js:
(This is copied from emoji-regex to show that it works the same.)

const emojiRegex = require('emoji-regex-xs');
// Or: import emojiRegex from 'emoji-regex-xs';

// Note: because the regular expression has the global flag set, this module
// exports a function that returns the regex rather than exporting the regular
// expression itself, to make it impossible to (accidentally) mutate the
// original regular expression.

const text = `
\u{231A}: ⌚ default emoji presentation character (Emoji_Presentation)
\u{2194}\u{FE0F}: ↔️ default text presentation character rendered as emoji
\u{1F469}: 👩 emoji modifier base (Emoji_Modifier_Base)
\u{1F469}\u{1F3FF}: 👩🏿 emoji modifier base followed by a modifier
`;

const regex = emojiRegex();
for (const match of text.matchAll(regex)) {
  const emoji = match[0];
  console.log(`Matched sequence ${emoji} — code points: ${[...emoji].length}`);
}

Console output:

Matched sequence   code points: 1
Matched sequence   code points: 1
Matched sequence ↔️  code points: 2
Matched sequence ↔️  code points: 2
Matched sequence 👩  code points: 1
Matched sequence 👩  code points: 1
Matched sequence 👩🏿  code points: 2
Matched sequence 👩🏿  code points: 2

Comparison with emoji-regex and \p{RGI_Emoji}

emoji-regex emoji-regex-xs \p{RGI_Emoji}
Compatibility • Node.js 4
• 2015-era browsers
• Node.js 10
• 2016-era browsers
• Node.js 20
• 2023-era browsers
Uncompressed size ~13 kB ~0.2 kB N/A
Gzipped size ~3 kB ~0.2 kB N/A
Unicode version Uses the latest Unicode version at the time of release, so results are deterministic. Uses the Unicode version that your environment supports natively, so results match the handling of other functionality.
Matches everything matched by ES2024's \p{RGI_Emoji} Yes. Yes. Yes.
Matches all non-RGI, underqualified emoji included in Unicode's emoji-test.txt Yes. Yes. No/none.
Matches additional non-RGI emoji Yes. Allows some overqualified emoji using an explicitly-defined list. Yes. Uses a general pattern that matches all Unicode sequences that follow the structure of emoji.[1] No.

Footnotes

  1. This allows emoji-regex-xs to match emoji supported on only some platforms (ex: women wrestling: light skin and Texas flag) that aren't correctly matched by emoji-regex.

More details about emoji, Unicode properties, and regexes

Emoji are complicated. Or more specifically, how they're defined in the Unicode Standard is complicated. So writing a regex that matches all/only emoji is also complicated. For starters, individual emoji can be made up of between one and many Unicode code points, and there are a variety of different sequence patterns. There are also a variety of Unicode symbols, dingbats, etc. that aren't emoji, that we don't want to match. And to make things worse, there are underqualified and overqualified emoji that are commonly diplayed as emoji and generated by certain emoji keyboards. The Unicode standard even includes a list of over a thousand underqualified emoji sequences that it recommends displaying and processing as emoji.

Given the complexity, many libraries that roll their own emoji regex get it very wrong, e.g. by matching emoji fragments that split off some of their attributes, or by matching things like digits (0, 1, 2, …), #, *, or certain invisible characters. These characters are obviously not emoji, but they're matched by naive patterns because they might become emoji when followed by various combining characters. Or they might be special characters used in emoji sequences while not being emoji on their own.

ES2018 added support for matching Unicode properties in regular expressions with \p{…}, so you might think this problem is now trivial, given that the list of supported properties includes Emoji, Emoji_Presentation (EPres), Emoji_Modifier (EMod), Emoji_Modifier_Base (EBase), Emoji_Component (EComp), and Extended_Pictographic (ExtPict). But no. On their own, none of these are what you want. They match only one code point at a time, and their matches variously include emoji fragments and non-emoji characters.

ES2024 added support for matching multicharacter Unicode properties of strings with \p{…}, so you might think one of the new properties Basic_Emoji, Emoji_Keycap_Sequence, RGI_Emoji_Modifier_Sequence, RGI_Emoji_Flag_Sequence, RGI_Emoji_Tag_Sequence, RGI_Emoji_ZWJ_Sequence, or RGI_Emoji will do the trick. Well, kind of. RGI_Emoji indeed seems like what we want, but unfortunately, some broadly-supported emoji are not in the official "RGI" (Recommended for General Interchange) list. It also doesn't match underqualified and overqualified emoji that include or exclude certain invisible Unicode markers. For example, the iOS emoji keyboard overqualifies certain emoji. So we need something that matches everything in RGI_Emoji, and more. Additionally, \p{RGI_Emoji} relies on flag v, which is only supported by 2023-era browsers (support) and Node.js 20+.

All of this is why the popular emoji-regex package exists. It does a great job of accurately matching most common-sense emoji. But to do so, it uses a massive (~13 kB uncompressed) regex that hard codes a list of Unicode code points that are tied to a specific Unicode version. Conversely, emoji-regex-xs uses a general pattern that continues to be highly accurate in matching emoji, but uses only ~0.2 kB to do so. It follows emoji-regex's API and reuses its tests, so it can be swapped-in as a replacement.

Note: The Unicode standard includes an official regex for matching emoji. However, although it can serve as a good foundation (after adapting to the JavaScript regex flavor), it matches some non-emoji characters like digits 0-9 and it matches fragments of some underqualified emoji (including some of those in the Unicode standard's emoji-test.txt list).

About

A regular expression to match all emoji-only symbols

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Contributors 3

  •  
  •  
  •