reuse Sarif base parser for snyk and mayhem parsers #12788

valentijnscholten · 2025-07-15T21:46:57Z

Over time two Sarif based parsers were added because they needed some custom parser logic to fully parse their scan reports:

Mayhem
Snyk Code

This resulted in some code duplication. This PR changes the Sarif base parser to be extensible and uses this to simplify the Snyk Code and Mayhem parsers. I didn't go as far as to create extension hooks for each an every field. This PR now has some examples that can be used in future PRs if more fields need to be extendible.

Initially I had the CWE parsing logic for the Mayhem taxa based CWEs also in an extenstion hook. But then I found out that the "taxa" format is actually somewhat official Sarif formatting. Also there was already code in the base Sarif parser looking for CWEs in various places. So the PR adds the "taxa way" to the base Sarif parser. This way it's also beneficial for any other non-Mayem report that might use this format.

Additional things in this PR:

Reduce code duplication between the snyk parser and snyk_code parser.
Enhance test cases for the snyk_code parser

[sc-11544]

dryrunsecurity · 2025-07-15T22:19:30Z

This pull request contains a potential Denial of Service (DoS) vulnerability in the SARIF parser where an unhandled ValueError could cause the file import process to crash if a SARIF file lacks necessary title generation fields.

Unhandled Exception leading to DoS in dojo/tools/sarif/parser.py

Vulnerability	Unhandled Exception leading to DoS
Description	The `get_finding_title` method in `dojo/tools/sarif/parser.py` explicitly raises a `ValueError` if it cannot extract a title from the SARIF result or rule. The calling method, `get_items_from_result`, does not contain a `try...except` block to catch this specific `ValueError` or a general exception. This means that if a SARIF file is provided that lacks the necessary fields for title generation (e.g., `message`, `shortDescription`, `fullDescription`, `name`, or `id` in the rule), the `ValueError` will propagate up the call stack, likely causing the SARIF parser to crash and fail the file import process.

django-DefectDojo/dojo/tools/sarif/parser.py

Lines 143 to 389 in 3ae607a

    
                   # if the data is here we try to convert it to datetime 
        
                   return dateutil.parser.isoparse(raw_date) 
        
               # Extension points for subclasses 
        
               def get_items_from_result(self, result, rules, artifacts, run_date): 
        
                   """ 
        
                   Main method to extract findings from a SARIF result. 
        
                   This method can be overridden by subclasses for custom behavior. 
        
                   """ 
        
                   # see 
        
                   # https://docs.oasis-open.org/sarif/sarif/v2.1.0/csprd01/sarif-v2.1.0-csprd01.html 
        
                   # / 3.27.9 
        
                   kind = result.get("kind", "fail") 
        
                   if kind != "fail": 
        
                       return None 
        
                   # if finding is suppressed, mark it as False Positive 
        
                   # Note: see 
        
                   # https://docs.oasis-open.org/sarif/sarif/v2.0/csprd02/sarif-v2.0-csprd02.html#_Toc10127852 
        
                   suppressed = False 
        
                   if result.get("suppressions"): 
        
                       suppressed = True 
        
                   # if there is a location get all files into files list 
        
                   files = [] 
        
                   if "locations" in result: 
        
                       for location in result["locations"]: 
        
                           file_path = None 
        
                           line = None 
        
                           if "physicalLocation" in location: 
        
                               file_path = location["physicalLocation"]["artifactLocation"]["uri"] 
        
                               # 'region' attribute is optionnal 
        
                               if "region" in location["physicalLocation"]: 
        
                                   # https://docs.oasis-open.org/sarif/sarif/v2.0/csprd02/sarif-v2.0-csprd02.html / 3.30.1 
        
                                   # need to check whether it is byteOffset 
        
                                   if "byteOffset" in location["physicalLocation"]["region"]: 
        
                                       pass 
        
                                   else: 
        
                                       line = location["physicalLocation"]["region"]["startLine"] 
        
                           files.append((file_path, line, location)) 
        
                   if not files: 
        
                       files.append((None, None, None)) 
        
                   result_items = [] 
        
                   for file_path, line, location in files: 
        
                       # test rule link 
        
                       rule = rules.get(result.get("ruleId")) 
        
                       # Get description from parser (uses inheritance) 
        
                       description = self.get_finding_description(result, rule, location) 
        
                       # Get title from parser (uses inheritance) 
        
                       title = self.get_finding_title(result, rule, location) 
        
                       # Get finding type (uses inheritance) 
        
                       static_finding, dynamic_finding = self.get_finding_type() 
        
                       finding = Finding( 
        
                           title=title, 
        
                           severity=get_severity(result, rule), 
        
                           description=description, 
        
                           static_finding=static_finding, 
        
                           dynamic_finding=dynamic_finding, 
        
                           false_p=suppressed, 
        
                           active=not suppressed, 
        
                           file_path=file_path, 
        
                           line=line, 
        
                           references=get_references(rule), 
        
                       ) 
        
                       if "ruleId" in result: 
        
                           finding.vuln_id_from_tool = result["ruleId"] 
        
                           # for now we only support when the id of the rule is a CVE 
        
                           if cve_try(result["ruleId"]): 
        
                               finding.unsaved_vulnerability_ids = [cve_try(result["ruleId"])] 
        
                       # some time the rule id is here but the tool doesn't define it 
        
                       if rule is not None: 
        
                           cwes_extracted = get_rule_cwes(rule) 
        
                           if len(cwes_extracted) > 0: 
        
                               finding.cwe = cwes_extracted[-1] 
        
                           # Some tools such as GitHub or Grype return the severity in properties 
        
                           # instead 
        
                           if "properties" in rule and "security-severity" in rule["properties"]: 
        
                               try: 
        
                                   cvss = float(rule["properties"]["security-severity"]) 
        
                                   severity = cvss_to_severity(cvss) 
        
                                   finding.cvssv3_score = cvss 
        
                                   finding.severity = severity 
        
                               except ValueError: 
        
                                   if rule["properties"]["security-severity"].lower().capitalize() in {"Info", "Low", "Medium", "High", "Critical"}: 
        
                                       finding.severity = rule["properties"]["security-severity"].lower().capitalize() 
        
                                   else: 
        
                                       finding.severity = "Info" 
        
                       # manage the case that some tools produce CWE as properties of the result 
        
                       cwes_properties_extracted = get_result_cwes_properties(result) 
        
                       if len(cwes_properties_extracted) > 0: 
        
                           finding.cwe = cwes_properties_extracted[-1] 
        
                       # manage the case that some tools produce CWE using taxa (official SARIF approach) 
        
                       cwes_taxa_extracted = get_result_cwes_taxa(result) 
        
                       if len(cwes_taxa_extracted) > 0: 
        
                           finding.cwe = cwes_taxa_extracted[-1] 
        
                       # Get custom CWEs if available (uses inheritance) 
        
                       custom_cwes = self.get_finding_cwes(result) 
        
                       if custom_cwes: 
        
                           finding.cwe = custom_cwes[-1]  # Use the last CWE like other logic 
        
                       # manage fixes provided in the report 
        
                       if "fixes" in result: 
        
                           finding.mitigation = "\n".join( 
        
                               [fix.get("description", {}).get("text") for fix in result["fixes"]], 
        
                           ) 
        
                       if run_date: 
        
                           finding.date = run_date 
        
                       # manage tags provided in the report and rule and remove duplicated 
        
                       tags = list(set(get_properties_tags(rule) + get_properties_tags(result))) 
        
                       tags = [s.removeprefix("external/cwe/") for s in tags] 
        
                       finding.tags = tags 
        
                       # manage fingerprints 
        
                       # fingerprinting in SARIF is more complete than in current implementation 
        
                       # SARIF standard make it possible to have multiple version in the same report 
        
                       # for now we just take the first one and keep the format to be able to 
        
                       # compare it 
        
                       if result.get("fingerprints"): 
        
                           hashes = get_fingerprints_hashes(result["fingerprints"]) 
        
                           first_item = next(iter(hashes.items())) 
        
                           finding.unique_id_from_tool = first_item[1]["value"] 
        
                       elif result.get("partialFingerprints"): 
        
                           # for this one we keep an order to have id that could be compared 
        
                           hashes = get_fingerprints_hashes(result["partialFingerprints"]) 
        
                           sorted_hashes = sorted(hashes.keys()) 
        
                           finding.unique_id_from_tool = "|".join( 
        
                               [f'{key}:{hashes[key]["value"]}' for key in sorted_hashes], 
        
                           ) 
        
                       # Allow subclasses to customize the finding (uses inheritance) 
        
                       self.customize_finding(finding, result, rule, location) 
        
                       result_items.append(finding) 
        
                   return result_items 
        
               def get_finding_cwes(self, result): 
        
                   """ 
        
                   Hook method for subclasses to extract custom CWE values from result. 
        
                   Override this method to add custom CWE extraction logic. 
        
                   """ 
        
                   return [] 
        
               def get_finding_title(self, result, rule, location): 
        
                   """ 
        
                   Get title for the finding. Subclasses can override this method 
        
                   to add custom title formatting. Use super() to get the base title. 
        
                   """ 
        
                   title = None 
        
                   if "message" in result: 
        
                       title = get_message_from_multiformatMessageString( 
        
                           result["message"], rule, 
        
                       ) 
        
                   if title is None and rule is not None: 
        
                       if "shortDescription" in rule: 
        
                           title = get_message_from_multiformatMessageString( 
        
                               rule["shortDescription"], rule, 
        
                           ) 
        
                       elif "fullDescription" in rule: 
        
                           title = get_message_from_multiformatMessageString( 
        
                               rule["fullDescription"], rule, 
        
                           ) 
        
                       elif "name" in rule: 
        
                           title = rule["name"] 
        
                       elif "id" in rule: 
        
                           title = rule["id"] 
        
                   if title is None: 
        
                       msg = "No information found to create a title" 
        
                       raise ValueError(msg) 
        
                   return textwrap.shorten(title, 150) 
        
               def get_finding_description(self, result, rule, location): 
        
                   """ 
        
                   Get description for the finding. Subclasses can override this method 
        
                   to add custom description formatting. Use super() to get the base description. 
        
                   """ 
        
                   description = "" 
        
                   message = "" 
        
                   if "message" in result: 
        
                       message = get_message_from_multiformatMessageString( 
        
                           result["message"], rule, 
        
                       ) 
        
                       description += f"**Result message:** {message}\n" 
        
                   if get_snippet(location) is not None: 
        
                       description += f"**Snippet:**\n```\n{get_snippet(location)}\n```\n" 
        
                   if rule is not None: 
        
                       if "name" in rule: 
        
                           description += f"**{_('Rule name')}:** {rule.get('name')}\n" 
        
                       shortDescription = "" 
        
                       if "shortDescription" in rule: 
        
                           shortDescription = get_message_from_multiformatMessageString( 
        
                               rule["shortDescription"], rule, 
        
                           ) 
        
                           if shortDescription != message: 
        
                               description += f"**{_('Rule short description')}:** {shortDescription}\n" 
        
                       if "fullDescription" in rule: 
        
                           fullDescription = get_message_from_multiformatMessageString( 
        
                               rule["fullDescription"], rule, 
        
                           ) 
        
                           if ( 
        
                               fullDescription != message 
        
                               and fullDescription != shortDescription 
        
                           ): 
        
                               description += f"**{_('Rule full description')}:** {fullDescription}\n" 
        
                   if len(result.get("codeFlows", [])) > 0: 
        
                       description += get_codeFlowsDescription(result["codeFlows"]) 
        
                   return description.removesuffix("\n") 
        
               def get_finding_type(self): 
        
                   """ 
        
                   Hook method for subclasses to specify finding type. 
        
                   Returns tuple of (static_finding, dynamic_finding). 
        
                   """ 
        
                   return (True, False)  # SARIF is static by definition 
        
               def customize_finding(self, finding, result, rule, location): 
        
                   """ 
        
                   Hook method for subclasses to customize the finding after creation. 
        
                   Override this method to add custom fields or modify the finding. 
        
                   """ 
        
           def get_rules(run): 
        
               rules = {}

All finding details can be found in the DryRun Security Dashboard.

paulOsinski · 2025-07-16T18:28:06Z

looks good to me / does what I need it to, anyway!

mtesauro

Approved

valentijnscholten added this to the 2.49.0 milestone Jul 15, 2025

github-actions bot added docs unittests parser labels Jul 15, 2025

valentijnscholten marked this pull request as ready for review July 15, 2025 22:18

valentijnscholten requested review from Maffooch and mtesauro as code owners July 15, 2025 22:18

mtesauro approved these changes Jul 18, 2025

View reviewed changes

Maffooch approved these changes Jul 21, 2025

View reviewed changes

Maffooch requested review from dogboat and blakeaowens July 21, 2025 15:20

dogboat approved these changes Jul 21, 2025

View reviewed changes

valentijnscholten added 2 commits July 21, 2025 21:03

reuse Sarif base parser for snyk and mayhem

66ad5a2

simplify

3ae607a

valentijnscholten force-pushed the sarif-extensibility branch from 570400f to 3ae607a Compare July 21, 2025 19:03

blakeaowens approved these changes Jul 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reuse Sarif base parser for snyk and mayhem parsers #12788

reuse Sarif base parser for snyk and mayhem parsers #12788

Uh oh!

valentijnscholten commented Jul 15, 2025 •

edited

Loading

Uh oh!

dryrunsecurity bot commented Jul 15, 2025 •

edited

Loading

Uh oh!

paulOsinski commented Jul 16, 2025 •

edited

Loading

Uh oh!

mtesauro left a comment

Uh oh!

Uh oh!

reuse Sarif base parser for snyk and mayhem parsers #12788

Are you sure you want to change the base?

reuse Sarif base parser for snyk and mayhem parsers #12788

Uh oh!

Conversation

valentijnscholten commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dryrunsecurity bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paulOsinski commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtesauro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

valentijnscholten commented Jul 15, 2025 •

edited

Loading

dryrunsecurity bot commented Jul 15, 2025 •

edited

Loading

paulOsinski commented Jul 16, 2025 •

edited

Loading