refactor: systematize and expose the parsing options and tokenizer configuration #3423

gwhitney · 2025-03-18T21:19:16Z

I thought it was wisest for the sake of evaluating #3420 to provide a taste of the refactor. So I am opening the corresponding pull request "early" and will mark it as a work in progress. So far, this is primarily just a refactor, to systematize and clarify the tokenizing code. But it does also expose the list of token scanners as a configuration property on the parse function, so it already enables additional functionality -- for example, adding a token type like #FF0080 for color constants (our use case at the moment, and I've put a whole demo of this facility in parse.test.js). The remaining idea is to do the same for the parse table -- in particular, the list of delimiters would come from the big table in operators.js, rather than being redundantly repeated, and the table would become decorated with the parsing functions for the corresponding level of precedence. That would easily allow extension by addition of new operators, at either existing or new levels of precedence (either by adding to one of the entries in the operators table, or by splicing in a new entry between existing ones). If you added a // operator, to take an example I am working on, you would just need to add to the operators table, and not have to remember to add it to some list of delimiters as well, etc.

So that's the idea. I will take a pause so you can have a chance to look at the PR so far and provide feedback and/or your encouragement or discouragement of continuing in this vein. I think you will find, for example, that the tokenization code is now much more transparent, since it has been factored into several much smaller scanner functions which are just run in sequence, each looking for one particular type of token.

(I guess what I would really like to do is rewrite the parser in some popular parser grammar package, but that would be beyond the level of attention I can devote, for now anyway.)

Resolves #3422.

gwhitney · 2025-03-19T16:41:55Z

OK, rebased after #3425, which was a localized fix for #3422.

gwhitney · 2025-03-19T19:54:12Z

Amusingly I just noticed that GitHub automatically put a color dot after the example notation I used, supporting the idea that it's good to enable such a syntax extension. I do think that there might be some value in our project publishing our Chroma and number-theory extensions to mathjs as add-on packages, if you have any thoughts/ideas about where/how to most effectively publish such things. (And in fact, if we were to set up such a collection of add-ons, we might well want to break out some of the existing groups, like statistics or sets, etc, as separate add-ons, to reduce the "monolithicness" of mathjs.)

josdejong · 2025-05-07T10:25:25Z

Nice Job @gwhitney! I like your parse.tokens = { ... } refactor, that indeed helps a lot making the code better readable and making it extensible, and moving ParserState into a separate file makes sense. Do you have ideas on how to handle ordering and precedence for custom tokens and token handlers?

gwhitney · 2025-05-07T15:43:39Z

Great, glad to hear. As to handling/controlling precedence, I would plan to do it exactly the same way it is currently done: there would be a parse.nodes parameter (say, maybe there's a better name, like parse.expressions?) for parsing analogous to parse.tokens, and the order of entries in that parameter would determine the precedence.

Would you like me to polish just this PR up for merging more or less as is and continue with expressions in a later PR, or go ahead with the similar refactor for expressions within this single PR?
Would you like me to keep this PR non-breaking, or go ahead and incorporate e.g. a fix to Surprising parse results from binary and octal constants #3421 and/or other approved breaking changes, and have this go into v15?

gwhitney · 2025-05-07T16:03:00Z

Ah, work on this paused at the moment per #3420 (comment)

Including a test that implements a custom token type `#HEXHEX` that could be used for color constants.

gwhitney · 2025-07-06T19:27:26Z

OK, have rebased on current develop. My plan is to adapt the existing configuration settings per the decisions above, and then call it a day on this pr: it will have exposed the tokenizer configuration and systematized the config settings. That seems like enough for one PR; a further similar refactor for the parsing configuration can be done in a separate PR, and should wait for #3497 to land anyway because otherwise that PR would have to basically be re-done from scratch.

gwhitney · 2025-07-06T21:35:23Z

Hmm, I have run across a pragmatic point with the plans here. Right now if you get the "library instance"

import * as math from './lib/esm/index.js'

then the config object is read-only (e.g. math.config({number: 'Fraction'}) throws an error). On the other hand, changing the parser configuration works fine:

math.parse.isAlpha = (c, n, p) => math.parse.isAlpha(c,n,p) || c === '@'
math.evaluate('@home = 7; sqrt(@home)')

returns a ResultSet with the square root of seven.

So my concern is that moving parser configuration items like parse.isAlpha to config.parse.isAlpha will suddenly make them only configurable in a create()-ed instance, which would to my mind be an undesirable breaking change (loss of functionality). I see three options:

Make config always modifiable, even in the "library instance". I don't know if this is feasible; why is config currently read-only in the library instance?
Go ahead with the parameter movement and just do it as a breaking change in v15, making the library instance somewhat less flexible for no reason other than to have all configuration in the same place.
Keep config read-only, but make config.parse refer to a particular object which may be mutated, even though you cannot change which particular object config.parse refers to. In this option, math.config({parse: {}}) would still be an error, but math.config().parse.isAlpha = c => c >= 'A' && c <= 'Z' would be allowed and would work to recognize only capital roman letters as alphabetic. Note that currently config() with no arguments returns a deep copy of the config object, so for this option to succeed, it would have to change config() to produce a shallow copy. So this plan might run into the same difficulty as option (1), depending on what the obstruction there is.

There may be other options as well. In any case, I'd best hold off once again until @josdejong illuminates the difficulties with mutating config in the library instance and/or picks an option for how to handle this. Thanks for getting back on this.

josdejong · 2025-07-09T13:33:50Z

I think ideally it should not be possible to alter the global/default mathjs instance like math.parse.isAlpha = ..., since that may conflict with other parts of a code base also using the same mathjs instance but not aware of these config changes.

It will indeed be a breaking change, so we'll need to document this and ideally throw an explanatory error when you try to use math.parse.isAlpha = ... in v15.

So that is your option (2). Agree?

gwhitney · 2025-07-09T19:02:54Z

OK, got it; I will modify this PR to be on top of v15, and to put the parsing configuration into config and make it not modifiable in the global instance. And since this is now already breaking, and plenty big enough especially with the moves into config, I will not try to subsume any other parser refactoring into this PR.

gwhitney marked this pull request as draft March 18, 2025 21:23

This was referenced Mar 19, 2025

variable named E blocks .-operators #3422

Closed

Surprising parse results from binary and octal constants #3421

Open

fix: #3422 parse dot operators after an implicit multiplication with symbol E #3425

Merged

gwhitney force-pushed the refactor/parse_config branch from 7ee5387 to 6dbd02f Compare March 19, 2025 16:41

gwhitney mentioned this pull request Mar 19, 2025

Tests for examples in embedded docs #3413

Merged

gwhitney mentioned this pull request Mar 19, 2025

Evaluating 'help(not)' throws SyntaxError #1905

Open

gwhitney mentioned this pull request Apr 23, 2025

fix: #3421 require a space or delimiter after hex, bin, and oct values #3463

Merged

gwhitney mentioned this pull request May 7, 2025

Proposed parsing refactor #3420

Open

josdejong mentioned this pull request Jun 4, 2025

Refactor: Modularize parse.js and add unit tests #3486

Closed

gwhitney added 7 commits July 6, 2025 12:14

refactor: make ParseState its own class in its own file

53bec06

refactor: remove NAMED_DELIMITERS to prep for using properties

384386b

refactor: fold reading comments and skipping characters into ParseState

350e568

doc/test: Document options and add tests

a47d9a5

Including a test that implements a custom token type `#HEXHEX` that could be used for color constants.

doc: comment the token scanner object where it appears

c20b1da

test: add test for josdejong#3422

0201b96

test: consolidate after rebasing on josdejong#3425

90af323

gwhitney force-pushed the refactor/parse_config branch from 0424ee0 to 90af323 Compare July 6, 2025 19:23

gwhitney mentioned this pull request Jul 9, 2025

v15 breaking changes #3453

Open

8 tasks

gwhitney changed the title ~~refactor: systematize and expose the parsing configuration~~ refactor: systematize and expose the parsing options and tokenizer configuration Jul 9, 2025

josdejong added this to the v15 milestone Jul 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

refactor: systematize and expose the parsing options and tokenizer configuration #3423

refactor: systematize and expose the parsing options and tokenizer configuration #3423

Uh oh!

gwhitney commented Mar 18, 2025 •

edited

Loading

Uh oh!

gwhitney commented Mar 19, 2025

Uh oh!

gwhitney commented Mar 19, 2025

Uh oh!

josdejong commented May 7, 2025

Uh oh!

gwhitney commented May 7, 2025

Uh oh!

gwhitney commented May 7, 2025

Uh oh!

gwhitney commented Jul 6, 2025

Uh oh!

gwhitney commented Jul 6, 2025 •

edited

Loading

Uh oh!

josdejong commented Jul 9, 2025

Uh oh!

gwhitney commented Jul 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

refactor: systematize and expose the parsing options and tokenizer configuration #3423

Are you sure you want to change the base?

refactor: systematize and expose the parsing options and tokenizer configuration #3423

Uh oh!

Conversation

gwhitney commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gwhitney commented Mar 19, 2025

Uh oh!

gwhitney commented Mar 19, 2025

Uh oh!

josdejong commented May 7, 2025

Uh oh!

gwhitney commented May 7, 2025

Uh oh!

gwhitney commented May 7, 2025

Uh oh!

gwhitney commented Jul 6, 2025

Uh oh!

gwhitney commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josdejong commented Jul 9, 2025

Uh oh!

gwhitney commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gwhitney commented Mar 18, 2025 •

edited

Loading

gwhitney commented Jul 6, 2025 •

edited

Loading

gwhitney commented Jul 9, 2025 •

edited

Loading