Skip to content

Sarif to ocsf #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 16, 2025
Merged

Sarif to ocsf #3

merged 2 commits into from
Apr 16, 2025

Conversation

northdpole
Copy link
Contributor

@northdpole northdpole commented Feb 15, 2025

This PR updates the SARIF library with the capability of translating to OCSF.
The tests include output from 5 different tools from various vendors

@northdpole northdpole requested a review from mbirbeck February 21, 2025 11:22
@northdpole northdpole force-pushed the sarif-to-ocsf branch 4 times, most recently from 344b938 to d826a84 Compare February 23, 2025 16:44
Copy link

@mbirbeck mbirbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been looking into the way Go uses “omitempty” JSON tags, as I have concerns this may cause problems for automated generation of Go data structures and then their later downstream use. In this PR autogenerated Go structures are being created in the spec/gen/sarif-schema/ directory.

The official definition from the Go json documentation is:

“The "omitempty" option specifies that the field should be omitted from the encoding if the field has an empty value, defined as false, 0, a nil pointer, a nil interface value, and any array, slice, map, or string of length zero.”

The problem is that sometimes what is considered empty may actually still have value, it depends on the individual case, what the variable actually represents or how the variable is being used. So this is application specific, that’s not something a code generation tool is going to know about.

As far as I can tell the problems can be avoided if you never convert the Go structures into JSON. So never encode them. The autogenerated SARIF structs in this PR are being used as a place in memory to decode SARIF data into before transcoding to OCSF. So that should be OK, as long as those data structures are never used for anything else, so never written out as JSON anywhere.

The code below demonstrates the problem I am talking about. I've created a fictitious struct ProjectSecurityInfo, imagine this is part of the SARIF object model.

type ProjectSecurityInfo struct {
	ProjectOwner  string `json:"projectOwner,omitempty"`
	NumberOfScans int    `json:"numberOfScans,omitempty"`
	SafeToDeploy  bool   `json:"safeToDeploy,omitempty"`
}

func main() {
	p := ProjectSecurityInfo{ProjectOwner: "Matt", NumberOfScans: 5, SafeToDeploy: true}
	b, _ := json.Marshal(p)
	fmt.Printf("Struct 1: %+v\n", p)
	fmt.Printf("JSON encoded struct 1: %+v\n\n", string(b))

	p = ProjectSecurityInfo{ProjectOwner: "", NumberOfScans: 0, SafeToDeploy: false}
	b, _ = json.Marshal(p)
	fmt.Printf("Struct 2: %+v\n", p)
	fmt.Printf("JSON encoded struct 2: %+v", string(b))
}

The code above produces the following output:

Struct 1: {ProjectOwner:Matt NumberOfScans:5 SafeToDeploy:true}
JSON encoded struct 1: {"projectOwner":"Matt","numberOfScans":5,"safeToDeploy":true}

Struct 2: {ProjectOwner: NumberOfScans:0 SafeToDeploy:false}
JSON encoded struct 2: {}

You can see the JSON encoded data in structure 2 is completely empty. Imagine if that data structure was being written to a datastore in JSON format. The JSON field safeToDeploy having a value of false is probably important information that we want to record. Nothing would be recorded though. Or perhaps we want to update the numberOfSecurityScans field to zero and we send the ProjectSecurityInfo structure within a JSON HTTP payload somewhere. The information at the other end will not be updated. The sender might well get an 200 OK HTTP response back, and perhaps even the application log on the client side might record that the data safeToDeply=false was successfully sent to the destination if the data fields in the log were printed from the structure itself rather than the encoded JSON data. So this issue can cause subtle bugs.

The solution is to use pointers to the data types, so *string, *int, *bool within the structure, or in the case of structure itself, a *struct. The default zero value of a pointer in Go is nil so if the field is truly empty it will then also meet the definition of omitempty in the spec, and will be excluded from the JSON encoding process. So the SafeToDeploy variable will be present in the JSON data whether it is true or false, which is what we would want. There is a blog page that talks about this issue as the author encountered these problems when working on go-github, and also talks about how this was also a problem when creating the Go support for protocol buffers, the solution being autogenerated fields that have pointers.

So it seems to me that any Go code generation tool without domain specific knowledge which produces fields with an omitempty JSON tag should be generating a pointer for that field data type, as it is not going to be aware of how the generated code might be used. But when I look at the generated SARIF structures I can see some fields that don’t meet that criteria. For example I can see a generated structure type called Invocation that has a bool field called ExecutionSuccessful, and that also has an omitempty JSON tag. So it looks to me that the autogeneration tool is producing code that could lead to subtle bugs in the future if the SARIF structures are encoded to JSON.

@andream16
Copy link
Contributor

Hey

Thank you so much for the detailed comment and the playground link!

You are highlighting an interesting issue to do with Go and its tags and can make things tricky sometimes.


Can you help me understanding a scenario that impacts you on consuming information produced when serialising information with these bindings?

Feel free to reply to me on Discord if some details are not OK for a public GH repo 😄

Curious to learn more 😃


Usually in go, omitempty is used for optional fields - fields that are not required by downstream services. Usually these fields are implemented with omitempty + pointer, for example, in this case only threadFlows is required and will be serialised with 0 value no matter what, even with an empty instance of CodeFlow. On the other end, message and properties will be skipped if not initialised, as they are empty. The pointer still helps in understanding if they are set!

type CodeFlow struct {
    Message *Message `json:"message,omitempty"`
    Properties *PropertyBag `json:"properties,omitempty"`
    ThreadFlows []ThreadFlow `json:"threadFlows"`
}

Link to your amended snippet :) https://go.dev/play/p/6MU7D_cdCeB


go-jsonschema uses omitempty + pointers for optional fields only, following the common guidelines.

For example, given the following schema snippet, only threadFlows is required and it's referenced as value without an omitempty tag:

"codeFlow": {
      "description": "A set of threadFlows which together describe a pattern of code execution relevant to detecting a result.",
      "additionalProperties": false,
      "type": "object",
      "properties": {
        "message": {
          "description": "A message relevant to the code flow.",
          "$ref": "#/$defs/message"
        },
        "threadFlows": {
          "description": "An array of one or more unique threadFlow objects, each of which describes the progress of a program through a thread of execution.",
          "type": "array",
          "minItems": 1,
          "uniqueItems": false,
          "items": {
            "$ref": "#/$defs/threadFlow"
          }
        },
        "properties": {
          "description": "Key/value pairs that provide additional information about the code flow.",
          "$ref": "#/$defs/propertyBag"
        }
      },
      "required": [
        "threadFlows"
      ]
    },

This makes sure that required fields are always present and optional fields can be leveraged and checked for them to be present. This distinction is also used behind the scenes for validation (required vs optional) which is quite powerful!


To recap, I understand your concerns but we're working on the basis of always propagating and being able to work with the required fields, so you shouldn't have issues downstream - let me know if this clarifies the concerns!

@mbirbeck
Copy link

mbirbeck commented Feb 25, 2025

Hi, responding to your question “Can you help me understanding a scenario that impacts you on consuming information produced when serialising information with these bindings?” The answer to that is no, as there isn’t a particular scenario impacting me. The points I was making are just general ones about the problems of using the omitempty tags in Go and the fact that if you don’t account for how these work when autogenerating code it can introduce bugs later on. This would apply to any code for any project. I created a fake scenario in my original post to try and help explain the sort of things that might go wrong.

When you say that usually omitempty is used for fields that are not required by downstream services are you saying that we are not concerned about any of the fields in the SARIF spec that are not listed as required?

Perhaps I am missing something, but I also don’t understand what your code snippets and playground link are demonstrating, perhaps you could clarify the points? It sounds like you might be agreeing with me when you mention fields with omitempty should have a pointer. But the problem I am highlighting is that there are autogenerated fields with omitempty that don’t have a pointer.

Finally your last point sounds again like you might be saying we are not concerned about fields if they are not required, but the vast majority of the SARIF spec fields are not required. The required fields for each object simply represent the bare minimum that needs to be present for the data to be considered valid SARIF. If we are just interested in the required fields, then why have the optional ones at all? That would fix the problem actually, as If they are not there then there is no chance for them to go wrong :)

As I mentioned before, in the use case of transcoding from SARIF to OCSF there is not a problem as the SARIF Go structures are not encoded into JSON, I am concerned though that once the data structures are created, they will likely be used for something else. Perhaps they could be put inside an internal directory to indicate they are not for general use. The main point I was trying to make was that having an automatic code generation process which is not taking into account the way omitempty works does leave the door open to the possibility of a problem occurring, whereas if the code generation process did take into account the way omitempty works, then it would rule out the possibility.

@northdpole northdpole force-pushed the sarif-to-ocsf branch 3 times, most recently from 2a90bfc to 64cc142 Compare April 9, 2025 15:46
@andream16 andream16 merged commit 3920208 into main Apr 16, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants