Skip to content

Commit 94bfa13

Browse files
authored
refactor: logs (#2181)
1 parent 5937940 commit 94bfa13

File tree

101 files changed

+4516
-3283
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

101 files changed

+4516
-3283
lines changed

.typos.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,6 @@ flate = "flate"
1515
[files]
1616
extend-exclude = [
1717
"pnpm-lock.yaml",
18-
"*/**/df-functions.md"
18+
"*/**/df-functions.md",
19+
"**/*.svg"
1920
]

docs/faq-and-others/faq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ Learn more about indexing: [Index Management](/user-guide/manage-data/data-index
219219

220220
**Real-Time Processing**:
221221
- **[Flow Engine](/user-guide/flow-computation/overview.md)**: Real-time stream processing system that enables continuous, incremental computation on streaming data with automatic result table updates
222-
- **[Pipeline](/user-guide/logs/pipeline-config.md)**: Data parsing and transformation mechanism for processing incoming data in real-time, with configurable processors for field extraction and data type conversion across multiple data formats
222+
- **[Pipeline](/reference/pipeline/pipeline-config.md)**: Data parsing and transformation mechanism for processing incoming data in real-time, with configurable processors for field extraction and data type conversion across multiple data formats
223223
- **Output Tables**: Persist processed results for analysis
224224

225225

docs/getting-started/quick-start.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -237,7 +237,7 @@ ORDER BY
237237
+---------------------+-------+------------------+-----------+--------------------+
238238
```
239239

240-
The `@@` operator is used for [term searching](/user-guide/logs/query-logs.md).
240+
The `@@` operator is used for [term searching](/user-guide/logs/fulltext-search.md).
241241

242242
### Range query
243243

docs/greptimecloud/integrations/fluent-bit.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ Fluent Bit can be configured to send logs to GreptimeCloud using the HTTP protoc
2828
http_Passwd <password>
2929
```
3030

31-
In this example, the `http` output plugin is used to send logs to GreptimeCloud. For more information, and extra options, refer to the [Logs HTTP API](https://docs.greptime.com/user-guide/logs/write-logs#http-api) guide.
31+
In this example, the `http` output plugin is used to send logs to GreptimeCloud. For more information, and extra options, refer to the [Logs HTTP API](https://docs.greptime.com/reference/pipeline/write-log-api/#http-api) guide.
3232

3333
## Prometheus Remote Write
3434

docs/greptimecloud/integrations/kafka.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Here we are using Vector as the tool to transport data from Kafka to GreptimeDB.
1313
## Logs
1414

1515
A sample configuration. Note that you will need to [create your
16-
pipeline](https://docs.greptime.com/user-guide/logs/pipeline-config/) for log
16+
pipeline](https://docs.greptime.com/user-guide/logs/use-custom-pipelines/) for log
1717
parsing.
1818

1919
```toml
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
---
2+
keywords: [built-in pipelines, greptime_identity, JSON logs, log processing, time index, pipeline, GreptimeDB]
3+
description: Learn about GreptimeDB's built-in pipelines, including the greptime_identity pipeline for processing JSON logs with automatic schema creation, type conversion, and time index configuration.
4+
---
5+
6+
# Built-in Pipelines
7+
8+
GreptimeDB offers built-in pipelines for common log formats, allowing you to use them directly without creating new pipelines.
9+
10+
Note that the built-in pipelines are not editable.
11+
Additionally, the "greptime_" prefix of the pipeline name is reserved.
12+
13+
## `greptime_identity`
14+
15+
The `greptime_identity` pipeline is designed for writing JSON logs and automatically creates columns for each field in the JSON log.
16+
17+
- The first-level keys in the JSON log are used as column names.
18+
- An error is returned if the same field has different types.
19+
- Fields with `null` values are ignored.
20+
- If time index is not specified, an additional column, `greptime_timestamp`, is added to the table as the time index to indicate when the log was written.
21+
22+
### Type conversion rules
23+
24+
- `string` -> `string`
25+
- `number` -> `int64` or `float64`
26+
- `boolean` -> `bool`
27+
- `null` -> ignore
28+
- `array` -> `json`
29+
- `object` -> `json`
30+
31+
32+
For example, if we have the following json data:
33+
34+
```json
35+
[
36+
{"name": "Alice", "age": 20, "is_student": true, "score": 90.5,"object": {"a":1,"b":2}},
37+
{"age": 21, "is_student": false, "score": 85.5, "company": "A" ,"whatever": null},
38+
{"name": "Charlie", "age": 22, "is_student": true, "score": 95.5,"array":[1,2,3]}
39+
]
40+
```
41+
42+
We'll merge the schema for each row of this batch to get the final schema. The table schema will be:
43+
44+
```sql
45+
mysql> desc pipeline_logs;
46+
+--------------------+---------------------+------+------+---------+---------------+
47+
| Column | Type | Key | Null | Default | Semantic Type |
48+
+--------------------+---------------------+------+------+---------+---------------+
49+
| age | Int64 | | YES | | FIELD |
50+
| is_student | Boolean | | YES | | FIELD |
51+
| name | String | | YES | | FIELD |
52+
| object | Json | | YES | | FIELD |
53+
| score | Float64 | | YES | | FIELD |
54+
| company | String | | YES | | FIELD |
55+
| array | Json | | YES | | FIELD |
56+
| greptime_timestamp | TimestampNanosecond | PRI | NO | | TIMESTAMP |
57+
+--------------------+---------------------+------+------+---------+---------------+
58+
8 rows in set (0.00 sec)
59+
```
60+
61+
The data will be stored in the table as follows:
62+
63+
```sql
64+
mysql> select * from pipeline_logs;
65+
+------+------------+---------+---------------+-------+---------+---------+----------------------------+
66+
| age | is_student | name | object | score | company | array | greptime_timestamp |
67+
+------+------------+---------+---------------+-------+---------+---------+----------------------------+
68+
| 22 | 1 | Charlie | NULL | 95.5 | NULL | [1,2,3] | 2024-10-18 09:35:48.333020 |
69+
| 21 | 0 | NULL | NULL | 85.5 | A | NULL | 2024-10-18 09:35:48.333020 |
70+
| 20 | 1 | Alice | {"a":1,"b":2} | 90.5 | NULL | NULL | 2024-10-18 09:35:48.333020 |
71+
+------+------------+---------+---------------+-------+---------+---------+----------------------------+
72+
3 rows in set (0.01 sec)
73+
```
74+
75+
### Specify time index
76+
77+
A time index is necessary in GreptimeDB. Since the `greptime_identity` pipeline does not require a YAML configuration, you must set the time index in the query parameters if you want to use the timestamp from the log data instead of the automatically generated timestamp when the data arrives.
78+
79+
Example of Incoming Log Data:
80+
```JSON
81+
[
82+
{"action": "login", "ts": 1742814853}
83+
]
84+
```
85+
86+
To instruct the server to use ts as the time index, set the following query parameter in the HTTP header:
87+
```shell
88+
curl -X "POST" "http://localhost:4000/v1/ingest?db=public&table=pipeline_logs&pipeline_name=greptime_identity&custom_time_index=ts;epoch;s" \
89+
-H "Content-Type: application/json" \
90+
-H "Authorization: Basic {{authentication}}" \
91+
-d $'[{"action": "login", "ts": 1742814853}]'
92+
```
93+
94+
The `custom_time_index` parameter accepts two formats, depending on the input data format:
95+
- Epoch number format: `<field_name>;epoch;<resolution>`
96+
- The field can be an integer or a string.
97+
- The resolution must be one of: `s`, `ms`, `us`, or `ns`.
98+
- Date string format: `<field_name>;datestr;<format>`
99+
- For example, if the input data contains a timestamp like `2025-03-24 19:31:37+08:00`, the corresponding format should be `%Y-%m-%d %H:%M:%S%:z`.
100+
101+
With the configuration above, the resulting table will correctly use the specified log data field as the time index.
102+
```sql
103+
DESC pipeline_logs;
104+
```
105+
```sql
106+
+--------+-----------------+------+------+---------+---------------+
107+
| Column | Type | Key | Null | Default | Semantic Type |
108+
+--------+-----------------+------+------+---------+---------------+
109+
| ts | TimestampSecond | PRI | NO | | TIMESTAMP |
110+
| action | String | | YES | | FIELD |
111+
+--------+-----------------+------+------+---------+---------------+
112+
2 rows in set (0.02 sec)
113+
```
114+
115+
Here are some example of using `custom_time_index` assuming the time variable is named `input_ts`:
116+
- 1742814853: `custom_time_index=input_ts;epoch;s`
117+
- 1752749137000: `custom_time_index=input_ts;epoch;ms`
118+
- "2025-07-17T10:00:00+0800": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%z`
119+
- "2025-06-27T15:02:23.082253908Z": `custom_time_index=input_ts;datestr;%Y-%m-%dT%H:%M:%S%.9f%#z`
120+
121+
122+
### Flatten JSON objects
123+
124+
If flattening a JSON object into a single-level structure is needed, add the `x-greptime-pipeline-params` header to the request and set `flatten_json_object` to `true`.
125+
126+
Here is a sample request:
127+
128+
```shell
129+
curl -X "POST" "http://localhost:4000/v1/ingest?db=<db-name>&table=<table-name>&pipeline_name=greptime_identity&version=<pipeline-version>" \
130+
-H "Content-Type: application/x-ndjson" \
131+
-H "Authorization: Basic {{authentication}}" \
132+
-H "x-greptime-pipeline-params: flatten_json_object=true" \
133+
-d "$<log-items>"
134+
```
135+
136+
With this configuration, GreptimeDB will automatically flatten each field of the JSON object into separate columns. For example:
137+
138+
```JSON
139+
{
140+
"a": {
141+
"b": {
142+
"c": [1, 2, 3]
143+
}
144+
},
145+
"d": [
146+
"foo",
147+
"bar"
148+
],
149+
"e": {
150+
"f": [7, 8, 9],
151+
"g": {
152+
"h": 123,
153+
"i": "hello",
154+
"j": {
155+
"k": true
156+
}
157+
}
158+
}
159+
}
160+
```
161+
162+
Will be flattened to:
163+
164+
```json
165+
{
166+
"a.b.c": [1,2,3],
167+
"d": ["foo","bar"],
168+
"e.f": [7,8,9],
169+
"e.g.h": 123,
170+
"e.g.i": "hello",
171+
"e.g.j.k": true
172+
}
173+
```
174+
175+
176+

docs/user-guide/logs/pipeline-config.md renamed to docs/reference/pipeline/pipeline-config.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -51,10 +51,10 @@ The above plain text data will be converted to the following equivalent form:
5151

5252
In other words, when the input is in plain text format, you need to use `message` to refer to the content of each line when writing `Processor` and `Transform` configurations.
5353

54-
## Overall structure
54+
## Pipeline Configuration Structure
5555

5656
Pipeline consists of four parts: Processors, Dispatcher, Transform, and Table suffix.
57-
Processors pre-processes input log data.
57+
Processors pre-process input log data.
5858
Dispatcher forwards pipeline execution context onto different subsequent pipeline.
5959
Transform decides the final datatype and table structure in the database.
6060
Table suffix allows storing the data into different tables.
@@ -827,6 +827,8 @@ Some notes regarding the `vrl` processor:
827827
2. The returning value of the vrl script should not contain any regex-type variables. They can be used in the script, but have to be `del`ed before returning.
828828
3. Due to type conversion between pipeline's value type and vrl's, the value type that comes out of the vrl script will be the ones with max capacity, meaning `i64`, `f64`, and `Timestamp::nanoseconds`.
829829

830+
You can use `vrl` processor to set [table options](./write-log-api.md#set-table-options) while writing logs.
831+
830832
### `filter`
831833

832834
The `filter` processor can filter out unneeded lines when the condition is meet.
@@ -1013,7 +1015,7 @@ Specify which field uses the inverted index. Refer to the [Transform Example](#t
10131015

10141016
#### The Fulltext Index
10151017

1016-
Specify which field will be used for full-text search using `index: fulltext`. This index greatly improves the performance of [log search](./query-logs.md). Refer to the [Transform Example](#transform-example) below for syntax.
1018+
Specify which field will be used for full-text search using `index: fulltext`. This index greatly improves the performance of [log search](/user-guide/logs/fulltext-search.md). Refer to the [Transform Example](#transform-example) below for syntax.
10171019

10181020
#### The Skipping Index
10191021

@@ -1159,4 +1161,4 @@ table_suffix: _${type}
11591161
These three lines of input log will be inserted into three tables:
11601162
1. `persist_app_db`
11611163
2. `persist_app_http`
1162-
3. `persist_app`, for it doesn't have a `type` field, thus the default table name will be used.
1164+
3. `persist_app`, for it doesn't have a `type` field, thus the default table name will be used.

0 commit comments

Comments
 (0)