- 
                Notifications
    You must be signed in to change notification settings 
- Fork 111
          Kafka: Expand default_msg_processor into a miniature decoding unit
          #306
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bee01aa    to
    7e91ef5      
    Compare
  
    | ``` | ||
| ```sh | ||
| ingestr ingest \ | ||
| --source-uri 'kafka://?bootstrap_servers=localhost:9092&group_id=test&value_type=json&select=value' \ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the same work without the select but the format set to flexible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using select implicitly sets the format to flexible.
| When using the `include` or `select` option, the decoder will automatically | ||
| select the `flexible` output format. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karakanb: Does this statement help on the question you had about the flexible output format?
| if options.format == "standard_v1": | ||
| UserWarning( | ||
| "Future versions of ingestr will use the `standard_v2` output format. " | ||
| "To retain compatibility, make sure to start using `format=standard_v1` early." | ||
| ) | ||
| return { | ||
| "_kafka": standard_payload, | ||
| "_kafka_msg_id": message_id, | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section might need further refinement.
a) Is the user warning emitted at all? Let's check that using a test case.
b) What if the user decides to deliberately continue using standard_v1? Can we have a mechanism to suppress the user warning in this case?
- Accept a bunch of decoding options per `KafkaDecodingOptions` - Provide a bunch of output formatting options per `KafkaEvent` - Tie both elements together using `KafkaEventProcessor` The machinery is effectively the same like before, but provides a few more options to allow type decoding for Kafka event's key/value slots, a selection mechanism to limit the output to specific fields only, and a small projection mechanism to optionally drill down into a specific field. In combination, those decoding options allow users to relay JSON-encoded Kafka event values directly into a destination table, without any metadata wrappings. The output formatter provides three different variants out of the box. More variants can be added in the future, as other users or use cases may have different requirements in the same area. Most importantly, the decoding unit is very compact, so relevant tasks don't need a corresponding transformation unit down the pipeline, to keep the whole ensemble lean, in the very spirit of ingestr.
| - `value_type`: The data type of the Kafka event `value_type` field. Possible values: `json`. | ||
| - `format`: The output format/layout. Possible values: `standard_v1`, `standard_v2`, `flexible`. | ||
| - `include`: Which fields to include in the output, comma-separated. | ||
| - `select`: Which field to select (pick) into the output. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo: Add a remark to the select explanation, a compressed version of:
Always use
select=valueto select the value [sic!] of the Kafka event. This has been chosen deliberately to adhere to the Kafka jargon, despite ingestr's internal event layout relays this field asdata.
Pitch
Loading data from Kafka into a database destination works well, but we found there are no options to specifically decode and break out the Kafka event value properly, in order to only relay that into the target database, without any metadata information.
Observation
For example, Kafka Connect provides such configuration options for similar use cases which are fragments hereof.
Solution
This patch slightly builds upon and expands the existing
default_msg_processorimplementation to accept a few more options which resolve our problem.Details
KafkaDecodingOptionsKafkaEventKafkaEventProcessorThe machinery is effectively the same like before, but provides a few more options to allow type decoding for Kafka event's key/value slots (
key_typeandvalue_type), a selection mechanism to limit the output to specific fields only (include), a small projection mechanism to optionally drill down into a specific field (select), and an option to select the output format (format).In combination, those decoding options allow users to relay JSON-encoded Kafka event values directly into a destination table, without any metadata wrappings. Currently, the output formatter provides three different variants out of the box (
standard_v1,standard_v2,flexible) 1. More variants can be added in the future, as other users or use cases may have different requirements in the same area.Most importantly, the decoding unit is very compact, so relevant tasks do NOT require a corresponding transformation unit down the pipeline, to keep the whole ensemble lean, in the very spirit of ingestr.
Preview
uv pip install --upgrade 'ingestr @ git+https://github.com/crate-workbench/ingestr.git@kafka-decoder'Example
duckdb kafka.duckdb 'SELECT * FROM demo.kafka WHERE sensor_id>1;'Backlog
Footnotes
The
standard_v2output format is intended to resolve Naming things: Rename_kafka_msg_idto_kafka__msg_id#289. ↩