-
Notifications
You must be signed in to change notification settings - Fork 49
OLS-2025: Collect config on startup #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds config collection: new UserDataCollection fields Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Starter as Service Starter
participant App as lightspeed_stack.main()
participant Model as UserDataCollection
participant Store as store_config()
participant FS as File System
Starter->>App: start --config-file <path>
App->>Model: load configuration (user_data_collection)
alt config_enabled == true
App->>Store: store_config(config_file)
Store->>FS: mkdir -p config_storage (if missing)
Store->>FS: write JSON {metadata, configuration: raw YAML}
Store-->>App: return stored file path
else config_enabled == false
App-->>Starter: log "config collection disabled"
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (11)
examples/lightspeed-stack-lls-external.yaml (1)
18-19: Config collection enabled in example — confirm intent and add a security warning comment.Enabling config collection by default in examples may persist raw YAML (including secrets like API keys) to disk. If intentional for demonstration, add an explicit warning to guide users on securing storage and permissions.
transcripts_storage: "/tmp/data/transcripts" - config_enabled: true + # WARNING: Enabling config collection stores the original YAML (may include secrets). + # Ensure config_storage is on a secure volume and files are written with restrictive permissions (e.g., 0600). + config_enabled: true config_storage: "/tmp/data/config"examples/lightspeed-stack-lls-library.yaml (1)
17-18: Same as external example: enabled by default — add a security warning comment.Mirror the caution in this example too so users understand the implications.
transcripts_storage: "/tmp/data/transcripts" - config_enabled: true + # WARNING: Enabling config collection stores the original YAML (may include secrets). + # Ensure config_storage is on a secure volume and files are written with restrictive permissions (e.g., 0600). + config_enabled: true config_storage: "/tmp/data/config"src/models/config.py (1)
220-223: Tighten validation: reject empty strings (and optionally non-absolute paths).Current check allows empty strings. This can lead to runtime errors or writing to unintended locations. At minimum, guard against blanks.
- if self.config_enabled and self.config_storage is None: + if self.config_enabled and ( + self.config_storage is None or self.config_storage.strip() == "" + ): raise ValueError( "config_storage is required when config collection is enabled" )Optional follow-ups (no blocking):
- Consider typing config_storage as Optional[Path] and validating directory semantics (absolute path, writable) similar to other path validations in this module.
- Docstring nit: rename “storage_location” to “storage fields” for consistency with field names.
src/app/endpoints/config.py (1)
43-44: Docs updated to include config fields — also consider redacting/guarding secrets in /config.The example now reflects config_enabled and config_storage. Independently of this change, please verify that /config is not publicly exposed when auth is disabled and that sensitive values (e.g., api_key) are redacted in responses where appropriate.
tests/unit/models/test_config.py (1)
601-603: LGTM on serialization expectations; add tests for the new validation paths.The dump assertion for config_enabled/config_storage looks good. Please add unit tests to cover:
- config_enabled=True with missing/blank config_storage should raise.
- config_enabled=True with a non-empty storage should pass.
Example tests to add:
def test_user_data_collection_config_enabled_requires_storage() -> None: with pytest.raises(ValueError, match="config_storage is required when config collection is enabled"): UserDataCollection(config_enabled=True, config_storage=None) with pytest.raises(ValueError, match="config_storage is required when config collection is enabled"): UserDataCollection(config_enabled=True, config_storage=" ") # blank def test_user_data_collection_config_enabled_with_storage_ok() -> None: cfg = UserDataCollection(config_enabled=True, config_storage="/tmp/data/config") assert cfg.config_enabled is True assert cfg.config_storage == "/tmp/data/config"If helpful, I can open a follow-up PR adding these tests.
src/lightspeed_stack.py (4)
43-50: Consider redacting secrets before persisting raw YAMLStoring the original YAML verbatim can leak credentials or tokens. If configs may contain secrets, add an opt-in redaction step or at least a warning in docs. A simple first step is masking common keys (token, api_key, password, secret, client_secret, private_key).
I can propose a small YAML-aware redaction helper if desired.
52-59: Ensure storage directory has restrictive permissionsCreate the directory (or tighten if it exists) with 0700 to avoid other users reading stored configs.
config_storage = configuration.user_data_collection_configuration.config_storage if config_storage is None: raise ValueError("config_storage must be set when config collection is enabled") storage_path = Path(config_storage) - storage_path.mkdir(parents=True, exist_ok=True) + storage_path.mkdir(parents=True, exist_ok=True) + try: + os.chmod(storage_path, 0o700) + except Exception: + logger.debug("Could not set 0700 permissions on config storage dir: %s", storage_path) config_file_path = storage_path / f"{suid.get_suid()}.json"
108-113: Don’t fail service startup if config persistence failsMake the collection best-effort. If I/O or permissions fail, log a warning and continue.
- # store service configuration if enabled - if configuration.user_data_collection_configuration.config_enabled: - store_config(args.config_file) - else: - logger.debug("Config collection is disabled in configuration") + # store service configuration if enabled (best-effort) + if configuration.user_data_collection_configuration.config_enabled: + try: + store_config(args.config_file) + except Exception as e: + logger.warning("Config collection failed: %s", e) + else: + logger.debug("Config collection is disabled in configuration")
30-50: Return the stored file path (or ID) to enable downstream correlationReturning the path or a content hash/SUID would let other components reference the stored config (e.g., attach ID to transcripts/feedback) without duplicating data.
Apply:
-def store_config(cfg_file: str) -> None: +def store_config(cfg_file: str) -> Path: @@ - logger.info("Service configuration stored in '%s'", config_file_path) + logger.info("Service configuration stored in '%s'", config_file_path) + return config_file_pathAnd in main, ignore the return value or propagate the ID as needed.
tests/unit/test_lightspeed_stack.py (2)
205-238: Add a negative test for missing storage path (config_storage=None)Covers the error path explicitly and protects against accidental regressions if validation changes.
You can append:
@patch("lightspeed_stack.configuration") def test_store_config_raises_when_storage_missing(mock_configuration_module, sample_config_file): mock_config = MagicMock() mock_config.user_data_collection_configuration.config_storage = None mock_configuration_module.user_data_collection_configuration = ( mock_config.user_data_collection_configuration ) with pytest.raises(ValueError): store_config(sample_config_file)
205-238: Optionally validate timestamp format includes UTC offsetIf you want to assert the UTC semantic, check it contains a timezone offset like +00:00.
Example:
assert stored_data["metadata"]["timestamp"].endswith("+00:00")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
examples/lightspeed-stack-lls-external.yaml(1 hunks)examples/lightspeed-stack-lls-library.yaml(1 hunks)src/app/endpoints/config.py(1 hunks)src/lightspeed_stack.py(3 hunks)src/models/config.py(2 hunks)tests/unit/models/test_config.py(1 hunks)tests/unit/test_lightspeed_stack.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
src/lightspeed_stack.py (2)
src/configuration.py (2)
configuration(55-60)user_data_collection_configuration(79-84)src/utils/suid.py (1)
get_suid(6-12)
tests/unit/test_lightspeed_stack.py (2)
src/lightspeed_stack.py (2)
create_argument_parser(66-93)store_config(30-63)src/configuration.py (1)
user_data_collection_configuration(79-84)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: e2e_tests
🔇 Additional comments (6)
src/models/config.py (1)
208-209: LGTM: New fields align with examples and endpoint docs.The addition of config_enabled and config_storage is consistent across the codebase.
src/lightspeed_stack.py (1)
30-39: Good addition: clear scope and rationale for storing config onceDocstring explains immutability and dedup rationale well.
tests/unit/test_lightspeed_stack.py (4)
3-10: Solid, focused test scaffoldingImports, fixtures, and patching strategy correctly isolate filesystem and module state. Good use of tmpdir and MagicMock.
67-96: End-to-end happy path coverage looks goodThis validates file creation and essential JSON structure, including path echo and content presence.
100-129: Nice check for directory auto-creationCovers a common deployment case where the storage path doesn’t pre-exist.
157-177: Good version propagation assertionPatching version ensures the metadata wiring is exercised.
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what it is good for? Do we need to support this feature? I mean - the timestamp etc., the service can be (re)started any time. And why the config file is copied (ok, transformed), if it does not have to contain all the config options?
Yes, we need to know how users are configuring the service. It contains also service version, which is another important piece of information. It is useful on its own, but also to handle breaking changes in config (if that is the case for postprocessing).
It is there to be able to correlate to the rest of the data, e.g. what config was used to generate concrete transcript.
And the config will be collected with each restart. It might produce duplicate data, but that is expected. Would you prefer to check the stored data first, to avoid storing the same config after the restart? I think this belongs more on the data processing side, but I'm fine doing it here.
This is how the data collection works. It stores what needs to be collected in defined place. The data source here is the file, so yes, it is basically copied, but how else would you do that? Note that the collection dir is purged after collection so we can't do symlink (it is forbidden in exporter too).
What do you mean? That the config file doesn't contain all the configuration that is later added in the runtime config object? We are interested in the user/input configuration - not what it is transformed to internally. Is llamastack config something users can change? Or is it shipped with the lightspeed-stack? If users can change it, then we also need to collect that - I would add some other metadata to distinguish these two configs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (7)
README.md (7)
128-130: Add an inline caution in the example about secrets in stored configs.When enabled, the service persists the raw YAML config. Add a brief YAML comment to warn users not to include secrets or to use sanitized configs.
Apply this diff:
transcripts_enabled: true transcripts_storage: "/tmp/data/transcripts" + # Caution: when enabled, the raw YAML config is stored with metadata. + # Avoid secrets in this file or use a sanitized copy. config_enabled: true config_storage: "/tmp/data/config"
246-248: Repeat the secrets caution in the library-client example block.Same rationale as above—store-only sanitized configs or ensure no secrets are present.
Apply this diff:
transcripts_enabled: true transcripts_storage: "/tmp/data/transcripts" + # Caution: when enabled, the raw YAML config is stored with metadata. + # Avoid secrets in this file or use a sanitized copy. config_enabled: true config_storage: "/tmp/data/config"
262-264: Add inline caution to the generic user_data_collection example.Make the risk explicit at each example site for clarity.
Apply this diff:
transcripts_enabled: true transcripts_storage: "/tmp/data/transcripts" + # Caution: when enabled, the raw YAML config is stored with metadata. + # Avoid secrets in this file or use a sanitized copy. config_enabled: true config_storage: "/tmp/data/config"
272-274: Fix markdownlint and minor grammar in the list.The linter expects asterisks (MD004). Also ensure “JSON” is fully spelled out.
Apply this diff:
-* `config_enabled`: Enable/disable collection of service configuration at startup -* `config_storage`: Directory path where configuration JSON files are stored +* `config_enabled`: Enable/disable collection of service configuration at startup +* `config_storage`: Directory path where configuration JSON files are storedIf your repo enforces MD004 globally, consider updating earlier list items as well for consistency (can be a follow-up).
661-663: Add inline caution in Quick Integration snippet.Mirrors the above examples; helps avoid accidental storage of sensitive config.
Apply this diff:
transcripts_enabled: true transcripts_storage: "/shared/data/transcripts" + # Caution: when enabled, the raw YAML config is stored with metadata. + # Avoid secrets in this file or use a sanitized copy. config_enabled: true config_storage: "/shared/data/config"
128-130: Consider documenting the /config endpoint and stored payload format.We mention the fields, but a short subsection under “Endpoints” describing the collected JSON schema (metadata + configuration) would help users and integrators.
I can draft a minimal “Configuration Collection” endpoint/doc blurb with example response and caveats. Want me to add it?
262-264: Review storage safeguards for config persistence
Sanitization
Thestore_configfunction reads and dumps the raw YAML under"configuration"(src/lightspeed_stack.py:40–43). There is no redaction or sanitization of secrets. If storing raw configs (including credentials) is acceptable, please document this decision clearly in the README. Otherwise, introduce a sanitization step or redact sensitive fields before persisting.Symlink protection
We currently passconfig_storagestraight toPath.mkdirand file writes (src/lightspeed_stack.py:56–61) without checking for symlinks. To prevent link traversal or accidental writes outside the intended directory, consider:
• Verifying thatconfig_storageand all parent directories are not symlinks (e.g., viaos.lstat().is_symlink())
• Resolving the path withPath.resolve(strict=True)and confirming it remains under an approved base directoryDirectory creation
Usingstorage_path.mkdir(parents=True, exist_ok=True)correctly handles nested paths and race conditions.Filename uniqueness
Filenames are generated withuuid4()(src/lightspeed_stack.py:58) and tested bytest_store_config_unique_filenames. A timestamp is included in the metadata (not in the filename). If you need chronological ordering of the files, consider adding a timestamp prefix to the filename.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
README.md(5 hunks)docs/deployment_guide.md(2 hunks)docs/getting_started.md(1 hunks)lightspeed-stack.yaml(1 hunks)tests/configuration/lightspeed-stack.yaml(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- docs/getting_started.md
- docs/deployment_guide.md
🧰 Additional context used
🪛 LanguageTool
README.md
[grammar] ~272-~272: There might be a mistake here.
Context: ...tion of service configuration at startup - config_storage: Directory path where configuration JSO...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
README.md
272-272: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
273-273: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
🔇 Additional comments (1)
tests/configuration/lightspeed-stack.yaml (1)
35-38: Configuration Tests Updated Successfully
The newtranscripts_enabled,transcripts_storage,config_enabled, andconfig_storageentries intests/configuration/lightspeed-stack.yamlcorrectly mirror the model changes and defaults. At runtime, both the transcript and config storage paths are created viastorage_path.mkdir(parents=True, exist_ok=True)in the application code, so no manual fixture or teardown is needed in the tests. CI jobs run in isolated
/tmpenvironments, preventing residue or permission issues between runs.
| config_enabled: true | ||
| config_storage: "/tmp/data/config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Do not enable config collection by default in shipped config (privacy/security risk).
Having config_enabled: true in the default config may persist raw configuration on first run, potentially capturing secrets, API keys, or internal file paths without explicit user consent. Recommend defaulting to false here and letting docs/examples instruct how to enable.
Apply this diff:
- config_enabled: true
+ config_enabled: false
config_storage: "/tmp/data/config"🤖 Prompt for AI Agents
In lightspeed-stack.yaml around lines 23 to 24, the default config_enabled is
set to true; change it to false to avoid enabling config collection by default,
leaving config_storage as-is; also update any docs/examples to show how to
explicitly enable config collection (and mention user consent and any migration
steps for existing installs) so users opt in rather than having it enabled by
default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is actually a good comment. In OLS, the config doesn't contain any secrets (if it does, it is user's fault and we are fine collecting it in this case), but the lightspeed-stack.yaml contains api_key. So I'll revisit the config and do some extra redaction.
cc @tisnik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| "service_version": version.__version__, | ||
| "config_file_path": cfg_file, | ||
| }, | ||
| "configuration": config_content, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another concern I have reading this, is that we do not redact anything from the configuration file. These files may include passwords [0][1], API keys [2], etc...
I'm not sure clients will want to send that anywhere. And even those storing this kind of content might have to deal with legal issues/procedures in order to do so.
[0]
lightspeed-stack/src/models/config.py
Line 19 in 45eb299
| tls_key_password: Optional[FilePath] = None |
[1]
lightspeed-stack/src/models/config.py
Line 64 in 45eb299
| password: str |
[2]
lightspeed-stack/src/models/config.py
Line 151 in 45eb299
| api_key: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup - see previous comment: #417 (comment)
I'll have a look at the redaction of these fields.
Description
Collect config on startup.
Similar to openshift/lightspeed-service#2582
Example data:
{ "metadata": { "timestamp": "2025-08-19T08:32:31.230917+00:00", "service_version": "0.1.3", "config_file_path": "/home/ometelka/projects/lightspeed-core/lightspeed-stack/dev/lightspeed-stack-conf.yaml" }, "configuration": "name: foo bar baz\nservice..." }Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation
Tests