RFC: PCIe LMT Output Format for OCP Integration #5

sksekar · 2023-09-20T14:29:11Z

sksekar
Sep 20, 2023
Maintainer

Objective

Add support for emitting test results compliant with OCP Test and Validation Output Specification from the PCIe LMT Diagnostic tool.

Background

A sample LMT test run performs:

For each LMT group listed in the platform configuration:
- For each sampling step provided in the configuration:
  - For each PCIe device listed in the LMT group
  - Report Lane Margining capabilities for the given PCIe device’s receiver
    - For each lane present in the PCIe device’s receiver
      - Perform Lane Margining tests
      - Report Lane Margining results

Current Output Format

CSV format

$ pci_lmt -o csv  [config_file](https://github.com/opencomputeproject/ocp-diag-pci_lmt/blob/main/configs/test_config.json)

test_info.run_id,test_info.timestamp,test_info.host_id,test_info.hostname,test_info.model_name,test_info.dwell_time_secs,test_info.elapsed_time_secs,test_info.error_count_limit,test_info.test_version,test_info.annotation,device_info.bdf,device_info.speed,device_info.width,device_info.lmt_capable,device_info.ind_error_sampler,device_info.sample_reporting_method,device_info.ind_left_right_timing,device_info.ind_up_down_voltage,device_info.voltage_supported,device_info.num_voltage_steps,device_info.num_timing_steps,device_info.max_timing_offset,device_info.max_voltage_offset,device_info.sampling_rate_voltage,device_info.sampling_rate_timing,device_info.max_lanes,device_info.reserved,lane,receiver_number,margin_type,step,sample_count,sample_count_bits,error_count,ber,error,error_msg
179366012392285637575280230165408925566,1694625794,319388925,rtptest6597.atn2,DELTALAKE_CPL_T17_FREYA_CWC,5,7.282705307006836,63,1.2.0,ASIC:PEX_SW:USP,0000:1a:00.0,16GT/s,8,True,1,0,1,1,1,127,31,50,50,0,0,7,0,0,6,timing_right,1,78,67108864,0,0.0,False,
[..]

JSON format

$ pci_lmt -o json  [config_file](https://github.com/opencomputeproject/ocp-diag-pci_lmt/blob/main/configs/test_config.json)

{"test_info": {"run_id": "1475535182966550306115162635493469189705", "timestamp": 1694625752, "host_id": "319388925", "hostname": "rtptest6597.atn2", "model_name": "DELTALAKE_CPL_T17_FREYA_CWC", "dwell_time_secs": 5, "elapsed_time_secs": 7.304824590682983, "error_count_limit": 63, "test_version": "1.2.0", "annotation": "ASIC:PEX_SW:USP"}, "device_info": {"bdf": "0000:1a:00.0", "speed": "16GT/s", "width": 8, "lmt_capable": true, "ind_error_sampler": 1, "sample_reporting_method": 0, "ind_left_right_timing": 1, "ind_up_down_voltage": 1, "voltage_supported": 1, "num_voltage_steps": 127, "num_timing_steps": 31, "max_timing_offset": 50, "max_voltage_offset": 50, "sampling_rate_voltage": 0, "sampling_rate_timing": 0, "max_lanes": 7, "reserved": 0}, "lane": 0, "receiver_number": 6, "margin_type": "timing_right", "step": 1, "sample_count": 78, "sample_count_bits": 67108864, "error_count": 0, "ber": 0.0, "error": false, "error_msg": ""}
[...]

Proposal

Current proposal is to use:

testRunArtifact to indicate the start and stop of the test including PASS/FAIL verdict.
testStepArtifact for each LMT Step in each LMT Group (Rx ID with margin direction)
measurementSeries for each BDF with validators.
MeasurementSeriesElement for each lane with actual bit errors

Sample Execution

pci_lmt -o ocp config_file

{"schemaVersion": {"major": 2, "minor": 0}, "sequenceNumber": 0, "timestamp": "2023-09-20T03:15:16.566543+01:00"}

{"testRunArtifact": {"testRunStart": {"name": "pci_lmt", "version": "1.2.0", "commandLine": "pci_lmt -o json config_file", "parameters": {}, "dutInfo": {"dutInfoId": "319388925", "name": "host1234.test", "platformInfos": [], "softwareInfos": [], "hardwareInfos": []}}}, "sequenceNumber": 1, "timestamp": "2023-09-20T03:15:16.566571+01:00"}

{"testStepArtifact": {"testStepId": "0", "testStepStart": {"name": "DUMMY_CARRIER_CARD_ASIC_RX6_TIMING_LEFT_STEP1"}}, "sequenceNumber": 2, "timestamp": "2023-09-20T03:15:16.566645+01:00"}
{"testStepArtifact": {"testStepId": "0", "measurementSeriesStart": {"name": "BDF_0000:1a:00.0", "unit": "bit_error_count", "measurementSeriesId": "0000:1a:00.0", "validators": [{"name": "eq_0", "type": "EQUAL", "value": 0}]}}, "sequenceNumber": 3, "timestamp": "2023-09-20T03:15:16.566729+01:00"}
{"testStepArtifact": {"testStepId": "0", "measurementSeriesStart": {"name": "BDF_0000:1b:00.0", "unit": "bit_error_count", "measurementSeriesId": "0000:1b:00.0", "validators": [{"name": "eq_0", "type": "EQUAL", "value": 0}]}}, "sequenceNumber": 4, "timestamp": "2023-09-20T03:15:16.566729+01:00"}
{"testStepArtifact": {"testStepId": "0", "measurementSeriesElement": {"index": 0, "value": 0, "timestamp": "2023-09-20T03:15:16.566784+01:00", "measurementSeriesId": "0000:1a:00.0"}}, "sequenceNumber": 5, "timestamp": "2023-09-20T03:15:16.566824+01:00"}
{"testStepArtifact": {"testStepId": "0", "measurementSeriesElement": {"index": 1, "value": 0, "timestamp": "2023-09-20T03:15:16.566870+01:00", "measurementSeriesId": "0000:1a:00.0"}}, "sequenceNumber": 6, "timestamp": "2023-09-20T03:15:16.566907+01:00"}
:
<for all lanes>
:
{"testStepArtifact": {"testStepId": "0", "measurementSeriesElement": {"index": 0, "value": 0, "timestamp": "2023-09-20T03:15:16.566784+01:00", "measurementSeriesId": "0000:1b:00.0"}}, "sequenceNumber": 7, "timestamp": "2023-09-20T03:15:16.566824+01:00"}
{"testStepArtifact": {"testStepId": "0", "measurementSeriesElement": {"index": 1, "value": 0, "timestamp": "2023-09-20T03:15:16.566870+01:00", "measurementSeriesId": "0000:1b:00.0"}}, "sequenceNumber": 8, "timestamp": "2023-09-20T03:15:16.566907+01:00"}
:
<for all lanes>
:
{"testStepArtifact": {"testStepId": "0", "measurementSeriesEnd": {"measurementSeriesId": "0000:1a:00.0", "totalCount": 2}}, "sequenceNumber": 9, "timestamp": "2023-09-20T03:15:16.567049+01:00"}
{"testStepArtifact": {"testStepId": "0", "measurementSeriesEnd": {"measurementSeriesId": "0000:1b:00.0", "totalCount": 2}}, "sequenceNumber": 10, "timestamp": "2023-09-20T03:15:16.567049+01:00"}
{"testStepArtifact": {"testStepId": "0", "testStepEnd": {"status": "COMPLETE"}}, "sequenceNumber": 11, "timestamp": "2023-09-20T03:15:16.567093+01:00"}
:
<for all steps & groups>
:
{"testRunArtifact": {"testRunEnd": {"status": "COMPLETE", "result": "FAIL"}}, "sequenceNumber": 22, "timestamp": "2023-09-20T03:15:16.568381+01:00"}

To Be Discussed

How do we expose capabilities of the device such as PCIe speed, width, Independent samplers, Number of voltage/timing steps etc? Can it be part of the HardwareInfos?

makestuffwork · 2023-09-23T20:17:17Z

makestuffwork
Sep 23, 2023

I'm still getting familiarized with the JSON spec. Here's how I'm making sense of the proposal:

I conceptualize the LMT hierarchies as System::Link::TorV::Port::Lane::Offset. The testStep corresponds to the Port scope, because a Port has one common test spec, and Ports are tested sequentially.
What's not clear to me is how to associate the testStepId with the higher scope info, such as the BDF. In the example above, everything is lumped into the testStepStart.name. Can we do better than that?

On the other end of the hierarchy, the lowest level of an LMR measurement has the following raw data: step (with +/- direction), status (response payload[7:6]), error count, sample count. That doesn't align with the MeasurementSeriesElement vs. the validator scheme. I see why we have to interpret it into a "bit errors" measurement. Do we plan to keep the raw data in this output?

Somehow, I feel it's more appropriate to assign the lane number, instead of the BDF, to the measurementSeriesId. Then each measurementSeriesElement can represent an offset.

Overall, the LMT outputs a lot more details that we need to tuck away in the extension fields. The LMR reported capability parameters may go into the MeasurementSeriesStart.Metadata. We should standardize the format for that.

0 replies

sksekar · 2023-10-11T14:24:50Z

sksekar
Oct 11, 2023
Maintainer Author

Summarizing the proposal provided by Dan via email:

Individual PCIe endpoints become each hardwareInfo object.
Bus level measurement can be used to report link level information (speed, width, LMT capabilities) linked to an PCIe endpoint.
Lane level measurements can be used to report LMT results (BER, bit-counts etc) linked to Bus level sub-component within a PCIe endpoint.

// Example hardwareInfo for each PCIE device
{
"hardwareInfoId": 1,
"computerSystem": "hostname_goes_here",
"Manager": "" // leave blank.
"Name": "0000:1a:00.0", // name of the PCIE device
"Location": "0" // A few options, could leave it blank, or include the lane number (or name + lane number),
"Odataid": "", //blank
"partNumber": "" //blank
"Serialnumber": "" //blank
"Manufacturer": "" //blank
"Manufacturerpartnumber": "" //blank
"partType": ""
"Version":""
"revision":""
}

// Bus level measurement:
{
name: "pcie_speed",
unit: "GT/s",
hardwareInfoId: 1
Validators: [],
Value: 16
}

// Lane level measurement:
{
name: "error_count" ,
unit: "",
hardwareInfoId: 1,
Subcomponent: {
"Type": "bus",
"Name": "lane_0",
"Version": "",
"Revision": ""
}
Validators: [],
Value:0
}

Summarizing the discussion happened on 10/5 meeting with Dan (Google), Hua (Google), Francesco (Meta), Leland (Meta), Adrian (Meta), Sathish (Meta):

Input parameters should consists of:

Set of links to be margined
Margin type (Voltage/Timing)
Margin direction (Up/Down or Right/Left)
Receiver
Dwell time
Target error-count or BER (for pass/fail determination)

HardwareInfo should ideally point to a link and not a single component

This is specifically required for systems with re-timers which don't have a BDF but still considered as an endpoint from LMT pov.
Link can be specified by a pair of BDFs.

SubComponentInfo can be used to specify the receiver, lane and the margining info:

For Voltage margining: Rx<#>:Lane<#>:+/-<#>mV
For Timing margining: Rx<#>:Lane<#>:+/-<#>UI
Conversion of step count to mV or UI% needs to be done by the tool

Measurements consists of the error count encountered on a specific receiver, lane while performing the given margining type.
Can also use metadata to specify more test specific information if needed.

// Example hardwareInfo for each PCIE link
{
"hardwareInfoId": 1,
"computerSystem": "hostname",
"Name": "0000:1a:00.0_0000:1b:00.0", // name of the PCIE link
"Location": "Motherboard" // Tray information
}

// Example Lane level measurement for voltage margining:
{
name: "error_count" ,
hardwareInfoId: 1,
Subcomponent: {
"Type": "voltage_margining",
"Name": "rx0:lane0:+1mv",
}
Value:0
}

Next steps:

Finalize the output format mapping and begin implementation.
Determine whether mapping needs to be done natively (within binary) or in a different layer (wrapper).

0 replies

makestuffwork · 2024-02-25T03:31:45Z

makestuffwork
Feb 25, 2024

@sksekar, Adrian and I had a 2.5hr meeting on 2024-02-23. Here are my takeaways from the meeting:

The spec we are defining is to facilitate the consumer of the diag output to interpret, to process, and to present the test result. The diag themselves may also play a role in processing and presenting the readings from the hardware. The spec should support a variety of LMT diags and use cases. Therefore, it needs to be expandable and unrestrictive as much as possible.

Our approach is to each come up with a spec proposal that suits our existing diags: https://github.com/opencomputeproject/ocp-diag-pci_lmt and https://github.com/google/pcie_lmt. We then resolve any conflict between them and find common ground.

We have agreed on a hierarchy of the LMT subjects (LMT is the diag. LMR is the PCI-SIG-specified PCIe feature. LMT conducts LMR.):

System: The LMR has to run from an enumerated PCIe topology according to the PCI-SIG spec. We can call it the DUT.
- Link: This is a well-established PCIe concept. It can be identified by a bus number in the DUT PCIe enumeration.
  - RxLane: This is a single receiver target that can map to a single SerDes hardware.
    - MarginPoint: This is the subject of one LMR step margin operation. It involves a timing or voltage offset.

At the very basic level, an LMR step margin operation returns three readings from the hardware: status, error_count, sample_count. We should start with mapping those to the output spec. This enables raw measurement collection.

Here are a few rule-of-thumb I learned about mapping info to the output spec:

TestStep should only be used for sequential operations.
Subcomponent loosely maps to hardware. So, it can be used for parallel entities.
A Measurement should not be completely deterministic by the test settings, though it's a convenient construct to report some interpretation of the test settings.
The "name" field indicates to the interpreter how to parse the info block. It should not require parsing itself. In my opinion, it should be independent rather than contextual.

Based on the above takeaways from the meeting, here's my proposal. Let's first review the ideas. The details need to be refined.

An LMT subject is mapped to a subcomponent. Here's an example :

      "subcomponent": {
        "name": "PCIELMT-MARGINPOINT-PCI",
        "location": "BUS:a1h--RX:6--LN:7--T:-0.15UI",
      },

The name PCIELMT-MARGINPOINT-PCI has three parts. PCIELMT is a self-context identifier. MARGINPOINT identifies the info type. PCI identifies the format as PCI enumeration, rather than a hardware port.

The "location" parser pattern for "PCIELMT-MARGINPOINT-PCI" is

pattern = re.compile(r"""
BUS:(?P<bus>\w+)--                 # Link (bus number)
RX:(?P<rx>[123456])--              # Rx (1-6)
LN:(?P<lane>\d+)--                 # Lane
(?P<torv>[TV]):(?P<offset>[+-]?[\d\.]+)(UI|%UI|mV|V|step)      # Timing or Voltage offset
""", re.X)

There are three measurements per a PCIELMT-MARGINPOINT-PCI subcompoment

    "measurement": {
      "name": "PCIELMT-MARGINPOINT-STATUS",
      "subcomponent": {
        "name": "PCIELMT-MARGINPOINT-PCI",
        "location": "BUS:a1h--RX:6--LN:7--T:-0.15UI",
      },
      "value": 2
    },

The value is specified by the PCIe spec: 3:NAK; 2:Margining in progress; 1:Set up for margin in progress. 0:Too many errors

    "measurement": {
      "name": "PCIELMT-MARGINPOINT-ERROR_COUNT",
      "subcomponent": {
        "name": "PCIELMT-MARGINPOINT-PCI",
        "location": "BUS:a1h--RX:6--LN:7--T:-5step",
      },
      "value": 63
    },

    "measurement": {
      "name": "PCIELMT-MARGINPOINT-SAMPLE_COUNT",
      "subcomponent": {
        "name": "PCIELMT-MARGINPOINT-PCI",
        "location": "BUS:a1h--RX:6--LN:7--V:0.05V",
      },
      "value":  1e12
    },

There are another set of raw measurements which are the RX LMR parameters read from the lane. These include
PCIELMT-RXLANE-NUM-TIMING-STEPS : uint32 : Number of Timing Steps 6-63
PCIELMT-RXLANE-MAX-TIMING-OFFSET : uint32 : Max Timing Offset 20-50
PCIELMT-RXLANE-SAMPLING-RATE-TIMING : uint32 : Sampling Rate Timing 0-63:(1-64)/64
PCIELMT-RXLANE-IND-LEFT-RIGHT-TIMING : bool : Independent Left/Right Timing
PCIELMT-RXLANE-VOLTAGE-SUPPORTED : bool : Whether voltage margining is supported.
PCIELMT-RXLANE-NUM-VOLTAGE-STEPS : uint32 : Number of Voltage Steps 32-127
PCIELMT-RXLANE-MAX-VOLTAGE-OFFSET : uint32 : Max Voltage Offset 5-50
PCIELMT-RXLANE-SAMPLING-RATE-VOLTAGE : uint32 : Sampling Rate Voltage 0-63
PCIELMT-RXLANE-IND-UP-DOWN-VOLTAGE : bool : Independent up/down voltage margining.
PCIELMT-RXLANE-IND-ERROR-SAMPLER : bool : Independent Error Sampler
PCIELMT-RXLANE-MAX-LANES : uint32 : Max Lanes minus 1: 0-31
PCIELMT-RXLANE-SAMPLE-REPORTING-METHOD : bool : rates:1 or count:0

Those measurements has the RXLANE subcomponent:

      "subcomponent": {
        "name": "PCIELMT-RXLANE-PCI",
        "location": "BUS:a1h--RX:6--LN:7",
      },

Also at this RXLANE level, the diag can output processed measurements, such as
PCIELMT-RXLANE-TMARGIN_POSTIVE
PCIELMT-RXLANE-TMARGIN_NEGATIVE
PCIELMT-RXLANE-VMARGIN_POSTIVE
PCIELMT-RXLANE-VMARGIN_NEGATIVE
PCIELMT-RXLANE-EYE_HEIGHT
PCIELMT-RXLANE-EYE_WIDTH

I can't think of anything need to be specified at the TestStep level.

One consideration is that the amount of measurement output can be a lot if we require the raw measurements.

5 replies

sksekar Feb 28, 2024
Maintainer Author

Thanks @makestuffwork for summarizing our discussion and thanks @mimir-d for your inputs during the discussion. Here is my proposal for the https://github.com/opencomputeproject/ocp-diag-pci_lmt tool.

Mapping System Hierarchy

System Hierarchy

System (DUT comprising a set of PCIe devices)
- Rx Device (receiver at which the margining needs to be performed)
  - Rx Margin Point (margin type & step at which the margin should be measured)
    - Rx Lane (lane within the link)

My proposal (slightly different from what we discussed) is to map the System level info in dutInfo and Device level info in hardwareInfo within the testRunArtifact as below:

{
  "testRunArtifact": {
    "testRunStart": {
      "name": "pcie_lmt",
      "version": "2.1",
      "commandLine": "pcie_lmt --config dut.cfg --dwell_time 10s",
      "dutInfo": {
        "name": "lmt_dut",
        "softwareInfos": [
          {
            "softwareType": "OS",
            "name": "CentOS",
            "version": "9",
            "revision": "0"
          }
        ],
        "hardwareInfos": [
          {
            "name": "cpu-root-port",
            "location": "0000:00:01.1",
            "manufacturer": "Intel",
            "partType": "CPU",
          },
          {
            "name": "pcie-switch-port-upstream",
            "location": "0000:01:00.0",
            "manufacturer": "Broadcom",
            "partType": "PCIe Switch",
          },
          {
            "name": "pcie-switch-port-downstream",
            "location": "0000:22:00.0",
            "manufacturer": "Broadcom",
            "partType": "PCIe Switch",
          },
          {
            "name": "accl-end-point",
            "location": "0000:23:00.0",
            "manufacturer": "Meta",
            "partType": "Accelerator,
          }
        ]
      }
    }
  },
}

Open Questions:

How can we add more information to the HardwareInfos? Like Lane count, Lane speed/width etc for each Receiver?

Mapping Margining Capabilities

Following capabilities are read from the device's config-space.

IndErrorSampler
SampleReportingMethod
IndLeftRightTiming
IndUpDownVoltage
VoltageSupported
NumVoltageSteps
NumTimingSteps
MaxTimingOffset
MaxVoltageOffset
SamplingRateVoltage
SamplingRateTiming
MaxLanes

Note: These are identical across all the lanes (within the link) and hence it's sufficient to report per Rx Device level (instead of per Rx Lane level)

Current proposal is to map the above information as individual Measurement in testStepArtifact for each Rx Device with optional validators (limits are available in the spec):

Sample:

{
  "testStepArtifact": {
    "measurement": {
      "name": "0000:00:01.1-num_voltage_steps",
      "unit": "step-count",
      "hardwareInfoId": "1",
      "validators": [
        {
          "name": "num_voltage_steps_upper_limit",
          "type": "LESS_THAN_OR_EQUAL",
          "value": 127.0
        },
        {
          "name": "num_voltage_steps_lower_limit",
          "type": "GREATER_THAN_OR_EQUAL",
          "value": 32.0
        }
      ],
      "value": 40.0
    },
  },
}

Margining Results

Following information are reported by the device to determine the status of margining at a given Margin Point:

SampleCount (or SampleCountBits)
ErrorCount
MarginStatus
- MarginingInProgress
- SetupInProgress
- TooManyErrors
- UnsupportedMarginCommand

Note: These values need to reported for each lane within the link and hence this will be the majority of the output.

Current proposal is to report the ErrorCount using the MeasurementSeries in testStepArtifact. The SampleCount and MarginStatus are surfaced using Metadata

Sample:

# One measurementSeriesStart/End for each Rx Device and Rx Margin Point
{
  "testStepArtifact": {
    "measurementSeriesStart": {
      "measurementSeriesId": "0",
      "name": "0000:00:01.1-timing_margin_right-step10-error_count",
      "unit": "%UI",
      "hardwareInfoId": "5",
      "validators": [
        {
          "name": "error_count_limit",
          "type": "EQUAL",
          "value": 0
        },
      ]
    },
  },
}

# One measurementSeriesElement for each Rx Lane which is margined to surface error_count.
{
  "testStepArtifact": {
    "measurementSeriesElement": {
      "index": 0,
      "measurementSeriesId": "0",
      "value": 0.0,
      "metadata": {
        "sampleCount": "1000",
        "marginStatus": "IN_PROGRESS",
      }
    },
  },
}
:
{
  "testStepArtifact": {
    "measurementSeriesElement": {
      "index": 1,
      "measurementSeriesId": "0",
      "value": 0.0,
      "metadata": {
        "sampleCount": "1000",
        "marginStatus": "IN_PROGRESS",
      }
    },
  },
}

{
  "testStepArtifact": {
    "measurementSeriesEnd": {
      "measurementSeriesId": "0",
      "totalCount": 8
    },
  },
}

Questions:

What's the difference between the status reported by TestRunEnd vs Diagnosis/Error artifacts?

mimir-d Feb 28, 2024
Maintainer

in reverse order, I can address some of the questions

@sksekar

What's the difference between the status reported by TestRunEnd vs Diagnosis/Error artifacts?

in TestRunEnd

TestStatus enumeration which specifies the execution status of the test run.
TestResult enumeration which indicates the overall result of the test run.

in Diagnosis

A diagnosis gives the verdict of the health status for hardware components under test.

in Error

An error is an output artifact that reports a software, firmware, test or any other hardware-unrelated issues

So, the answer is: Diagnosis explains a conclusion of the test exercise, which is useful on its own. An Error is closer to logging, but specifically different in that it can show a singular error that happened during the test (or before, as in failing to determine DUT details). A TestRunEnd status/result fields specify whether the test run was completed fully and what the overall outcome was. One can make a correlation between them, in that an Error regarding hardware could result in a negative Diagnosis which then results in a negative TestRunEnd result. However, the spec is not prescriptive on these semantics and the diag may choose to determine itself whether this causal chain is valid.

@sksekar

How can we add more information to the HardwareInfos? Like Lane count, Lane speed/width etc for each Receiver?

Those sound to me like subcomponents (and it matches with the spec semantics as the "lane", "receiver", etc are not physical). However, you can choose to increase the granularity of the HardwareInfo and add additional hwinfo objects that specify those lanes, by adding to the location (eg. "/lane0"); however this may conflict somewhat with the semantic of the object (needs to have a manufacturer, etc).

@makestuffwork

One consideration is that the amount of measurement output can be a lot if we require the raw measurements.

That is expected. The diag output as per the spec semantics should be as complete as possible and let the consumer filter whatever they need. Having said that, there is nothing to say that you couldnt filter the output from the diag by providing some input parameter.

both @sksekar @makestuffwork

your mappings (link/rxlane/margin and rxdevice/margin/rxlane) construe different scenarios. In my opinion both should be able to be selected by a user of the diag by virtue of an input parameter. Having that they conflict in execution (one follows the goal of establishing the capacity/width of the signal eye and one follows validating expectations of error counts at different given margins), I see the diag as a mux of the 2 goals and hence why the user should select one per execution (although i suppose both could also be done sequentially as different steps?).

makestuffwork Mar 2, 2024

Note: These are identical across all the lanes (within the link) and hence it's sufficient to report per Rx Device level (instead of per Rx Lane level)

That is a valid assumption. However, validating the assumptions is the purpose of testing. We can totally apply one parameter readings across all lanes as a test condition. However, when we treat the parameter readings a measurements, it should be attributed to the RxLane from where it's read. We should always leave room for defects.

Current proposal is to report the ErrorCount using the MeasurementSeries in testStepArtifact. The SampleCount and MarginStatus are surfaced using Metadata

I remember that we gave up the MeasurementSeries approach during our first meeting, because of its specific use case which is not applicable here. For example, the MarginStatus is a reading from the hardware. We want the spec to make sure that the consumer can get this data from the diag. Therefore, it deserves to be aMeasurement on its own. The Metadata is too loosely specified for that purpose. Let's talk more about this.

your mappings (link/rxlane/margin and rxdevice/margin/rxlane) construe different scenarios.

I think the common ground here is that they point out where the measurement is made, and we can use the Subcomponent to capture that info. The difference is only in the format. If you look at my "PCIELMT-MARGINPOINT-PCI" Subcomponent format example, we can use various format identifiers and parser strings to accommodate both scheme.

mimir-d Mar 13, 2024
Maintainer

I remember that we gave up the MeasurementSeries approach

@sksekar please confirm if this measurement series is time based. Are those measurement element sample points taken at different points in wallclock time? If so, that would be a valid use of series

on a related note, looking at the json, i think we can actually use the id to encode the testing scenario, so instead of

    "measurementSeriesStart": {
      "measurementSeriesId": "0",
      "name": "0000:00:01.1-timing_margin_right-step10-error_count",

it could be

    "measurementSeriesStart": {
      "measurementSeriesId": "0000:00:01.1-timing_margin_right-step10",
      "name": "error_count",

makestuffwork Mar 20, 2024

I want to give the diag the flexibility of using either measurement series, or a single measurement just for the end BER. However, Measurement does not have a measurementSeriesId.

Also, there are three readings per MarginPoint:

ErrorCount
MarginStatus
SampleCount (or SampleCountBits)

So, we need three measurement series or measurements, and we need to use the name to identify each.

For the purpose of maximizing flexibility, I want to use the Name field to identify how to interpret the info in the object. Then the Name field should not need parsing.

I don't think there's a better place to encode the timing/voltage offset than under the Subcomponent. We can attach the same Subcomponent to multiple measurement series and/or measurements

sksekar · 2024-02-28T16:18:38Z

sksekar
Feb 28, 2024
Maintainer Author

Thanks @mimir-d. Some comments/follow-up

the spec is not prescriptive on these semantics and the diag may choose to determine itself whether this causal chain is valid.

SGTM. My preference would be use TestRunEnd to declare overall pass/fail status.

Those (Like Lane count, Lane speed/width etc for each Receiver) sound to me like subcomponents

Looking closer, I think we can surface these information as part of Device/Margining Capabilities measurement which is already surfaced per Device.

I see the diag as a mux of the 2 goals and hence why the user should select one per execution

Yes, that's correct. In our case, user is allowed to select the goal (eye-scan vs spot-check) using the config file provided as input.

although i suppose both could also be done sequentially as different steps?

Yes, there is no restriction in doing both (eye-scan and spot-check) as different steps.

0 replies

makestuffwork · 2024-04-24T05:57:29Z

makestuffwork
Apr 24, 2024

As discussed, attached is an output example for our discussion:
lmt_ocp_output_example.json

3 replies

sksekar Apr 24, 2024
Maintainer Author

Thanks for sharing the output. This helps answering multiple questions that I had in my mind. But, there are few more:

What constitutes a block within a testStepStart and testStepEnd?
Looks like (based on the timestamps) your test is running multiple lanes simultaneously and each lane might be running either VOLTAGE or TIMING independently. Is that right?
You don't seem to have Validators to determine if the device passed/failed margining. Is that expected?
Do you happen to have testRunArtifact for your test run? I'm curious about Hardware Info and Component Info :)

My solution is almost the same with few minor differences:

My test would emit a measurement only the final status (either the test completed or errored out) for each lane. So, you will see only PCIELMT-MARGINPOINT-SAMPLE_COUNT and PCIELMT-MARGINPOINT-ERROR_COUNT.
The "location" info would be slightly different but we can converge. Do we need to have standardized format here?
I'm planning to have validators to declare PASS/FAIL which is required for MFG use-case.

makestuffwork Apr 24, 2024

Thanks for the quick review!

First, that output example is by no means my ideal. It's mainly to demonstrate some constraints we are facing, and the flexibility we desire.

The diag execution has at least two dimensions: One is time, and the other is the system components which can be hierarchical, parallel, or a mixture of both. One constraint is that the TestStep forces the output into a single chronological thread. Therefore, we have to use the Subcomponent to traverse the system within that single thread. This was brought up a couple weeks ago, and I remember @mimir-d was open to support a threading nomenclature in the output spec. Yes, my LMT tests multiple lanes simultaneously and independently for the sake of test time saving.

The PCI-SIG's LMR spec has its own "validator", and the checking result is reflected in the PCIELMT-MARGINPOINT-STATUS measurement. My intention is to keep my LMT within the scope of PCI-SIG's standard. I use it as a generic bench testing tool. To use it as a DUT-specific production diag, there is another piece. It translates the PCIe BDF to HW location for repair instructions. It can also add the OCP Validator to qualify the PCIELMT-MARGINPOINT-STATUS, or any other measurement. Those are very production-DUT-specific.

I want to standardize the flexibility. Therefore, I use the "subcomponent.name" field to indicate the format of the "location". For example, the name PCIELMT-MARGINPOINT-PCI has three parts. PCIELMT is a self-context identifier. MARGINPOINT identifies the info type. PCI identifies the format as PCI enumeration, rather than a hardware port. If a new diag has info that does not fit into this format, it can define a new "name". This does seem very complicated. It'd be nice if the OCP format spec can help simplify it.

Another guideline-constraint is that the diag should report all raw measurements. However, the output consumer may not care about all the details. An interpreted PASS/FAIL may be sufficient. The raw measurement can be dumped to a separate non-OCP-format log file for offline analysis purpose. Then, it makes this LMT output much more consumer-centric and standardizable.

makestuffwork Apr 26, 2024

I asked around. There is nothing preventing us from using the TestStep to model each parallel executing test threads. If so, I propose to use TestStep to model at the receiver-port level. For example: "BUS:a1h--RX:6". There is still parallelism at the lane level within a TestStep. However, the receiver-port usually points to a hardware component that's not sub divisible, and subject to a repair.

makestuffwork · 2024-05-14T21:24:44Z

makestuffwork
May 14, 2024

We saw the ocp-diag-core-viewer demo a few weeks ago. With that, I tuned the pcie_lmt OCP output the way I'd like to see as a user. This pcie_lmt_ocp.json is a sample output. I can also demo it from the ocp-diag-core-viewer in the meeting.

The pcie_lmt can stream the OCP artifacts to a file or a named pipe. Our use case has a diag-runner which creates this OCP pipe and listens to it. It converts the PCIe-domain BDF info to the DUT-specific HW-Info.This way, the pcie_lmt can stay generic and only PCI-SIG-aware. The diag-runner is also generic in the sense that it can run various diags as a sub-process.

The pcie_lmt runs parallel TestSteps, each maps to an RX-port. Within each TestStep, the lanes are also running in parallel. I'm counting on the sorting and filtering features of the result viewers.

Instead of dumping all the raw measurements, I now only output what "matters". The raw measurements are still dumped in a log for reference. As a user, I'd like to see the interpreted results upfront. So the pcie_lmt outputs eye size, eye corner margin, BER, and/or status. Irrelevant and/or implied information, such as 0-error margin points in an eye-scan, are omitted.

Still, there are more info to fit in a Measurement artifact. I'm overloading the unit and the hw_info_id fields. I'm also put the lane number in the name field for now. Maybe the subcomponent is a better place.

0 replies

RFC: PCIe LMT Output Format for OCP Integration #5

Uh oh!

Uh oh!

sksekar Sep 20, 2023 Maintainer

Objective

Background

Current Output Format

CSV format

JSON format

Proposal

Sample Execution

To Be Discussed

Replies: 6 comments · 8 replies

Uh oh!

Uh oh!

Uh oh!

sksekar Oct 11, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

sksekar Feb 28, 2024 Maintainer Author

Mapping System Hierarchy

Mapping Margining Capabilities

Margining Results

Uh oh!

mimir-d Feb 28, 2024 Maintainer

Uh oh!

Uh oh!

mimir-d Mar 13, 2024 Maintainer

Uh oh!

Uh oh!

sksekar Feb 28, 2024 Maintainer Author

Uh oh!

Uh oh!

sksekar Apr 24, 2024 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

sksekar
Sep 20, 2023
Maintainer

Replies: 6 comments 8 replies

sksekar
Oct 11, 2023
Maintainer Author

sksekar Feb 28, 2024
Maintainer Author

mimir-d Feb 28, 2024
Maintainer

mimir-d Mar 13, 2024
Maintainer

sksekar
Feb 28, 2024
Maintainer Author

sksekar Apr 24, 2024
Maintainer Author