Skip to content

Non-reproducibility and crashes related to pixel tracking (CAHitNtupletAlpakaHIonPhase1) in the HIon integration tests #49186

@mmusich

Description

@mmusich

During the integration of the main ticket for the 2025 PbPb menu CMSHLT-3658, I stumbled upon non-reproducibilities and crashes related to the pixel tracking package, for example:

Thread 18 (Thread 0x7f5e5f083640 (LWP 2474523) "cmsRun"):
#0  0x00007f5eae30200f in poll () from /lib64/libc.so.6
#1  0x00007f5ea9d92297 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_15_1_0/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f5ea9d92494 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_15_1_0/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f5e47f1b10d in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka_serial_sync::caHitNtupletGeneratorKernels::Kernel_fillGenericPair, caStructures::CAP\
airLayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, true>&, unsigned int*, cms::alpakatools::OneToManyAssocRandomAccess<unsigned int, -1, -1>*>::operator()() const () from /cvmfs/cms.cern\
.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_15_1_0/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#5  0x00007f5e47fba716 in void alpaka::exec<alpaka::AccCpuSerial<std::integral_constant<unsigned long, 1ul>, unsigned int>, alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, alpaka::WorkDivMembers<std\
::integral_constant<unsigned long, 1ul>, unsigned int>, alpaka_serial_sync::caHitNtupletGeneratorKernels::Kernel_fillGenericPair, caStructures::CAPairLayout<128ul, false>::ViewTemplateFreeParams<128ul, f\
alse, true, true>&, unsigned int*, cms::alpakatools::OneToManyAssocRandomAccess<unsigned int, -1, -1>*>(alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&, alpaka::WorkDivMembers<std::integral_constant\
<unsigned long, 1ul>, unsigned int> const&, alpaka_serial_sync::caHitNtupletGeneratorKernels::Kernel_fillGenericPair const&, caStructures::CAPairLayout<128ul, false>::ViewTemplateFreeParams<128ul, false,\
 true, true>&, unsigned int*&&, cms::alpakatools::OneToManyAssocRandomAccess<unsigned int, -1, -1>*&&) [clone .constprop.0] [clone .isra.0] () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_15_1\
_0/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#6  0x00007f5e47f1d87f in alpaka_serial_sync::CAHitNtupletGeneratorKernels<pixelTopology::HIonPhase1>::launchKernels(reco::TrackingHitsLayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true\
, true> const&, unsigned int, unsigned short, reco::TrackLayout<128ul, false>::ViewTemplateFreeParams<128ul, false, true, true>&, reco::TrackHitsLayout<128ul, false>::ViewTemplateFreeParams<128ul, false,\
 true, true>&, reco::CALayersLayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, reco::CAGraphLayout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> c\
onst&, alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_15_1_0/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#7  0x00007f5e47f12a5e in alpaka_serial_sync::CAHitNtupletGenerator<pixelTopology::HIonPhase1>::makeTuplesAsync(reco::TrackingRecHitHost const&, PortableHostMultiCollection<reco::CALayersLayout<128ul, fa\
lse>, reco::CAGraphLayout<128ul, false>, reco::CAModulesLayout<128ul, false> > const&, float, unsigned int, unsigned int, alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&) const () from /cvmfs/cms.ce\
rn.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_15_1_0/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#8  0x00007f5e47f12dc8 in alpaka_serial_sync::CAHitNtupletAlpaka<pixelTopology::HIonPhase1>::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /cvmfs/cms.\
cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_15_1_0/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#9  0x00007f5e47f0aed9 in alpaka_serial_sync::stream::EDProducer<edm::GlobalCache<reco::CAGeometryParams>, edm::RunCache<cms::alpakatools::MoveToDeviceCache<alpaka::DevCpu, PortableHostMultiCollection<re\
co::CALayersLayout<128ul, false>, reco::CAGraphLayout<128ul, false>, reco::CAModulesLayout<128ul, false> > > > >::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/c\
ms/cmssw/CMSSW_15_1_0/lib/el9_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#10 0x00007f5eaf655d75 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12\
/cms/cmssw/CMSSW_15_1_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f5eaf63a59c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/\
CMSSW_15_1_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#12 0x00007f5eaf5c0d39 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::excepti\
on_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchA\
ctionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_15_1_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#13 0x00007f5eaf5c1234 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_15_1_0/li\
b/el9_amd64_gcc12/libFWCoreFramework.so
#14 0x00007f5eaf802388 in tbb::detail::d2::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CM\
SSW_15_1_0/lib/el9_amd64_gcc12/libFWCoreConcurrency.so
#15 0x00007f5eaf75b5da in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=<optimized out>, waiter=..., this=0x7f5ead5d1f00) at /data/cmsbld/jenkin\
s/workspace/build-any-ib/w/BUILD/el9_amd64_gcc12/external/tbb/v2022.0.0-1feaa53e42a55cbacf84e40dc1fc78f6/tbb-v2022.0.0/src/tbb/task_dispatcher.h:334
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f5ead5d1f00) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd6\
4_gcc12/external/tbb/v2022.0.0-1feaa53e42a55cbacf84e40dc1fc78f6/tbb-v2022.0.0/src/tbb/task_dispatcher.h:470
#17 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc12/external/tbb/v2022.0.0-1feaa53e42a55cbacf84e40dc1fc78f6/tbb-v202\
2.0.0/src/tbb/arena.cpp:215
#18 tbb::detail::r1::thread_dispatcher_client::process (td=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc12/external/tbb/v2022.0.0-1feaa53e42a55cbacf84e40\
dc1fc78f6/tbb-v2022.0.0/src/tbb/thread_dispatcher_client.h:41
#19 tbb::detail::r1::thread_dispatcher::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc12/external/tbb/v2022.0.0-1feaa53e42a55cbacf84e40dc1fc78f\
6/tbb-v2022.0.0/src/tbb/thread_dispatcher.cpp:195
#20 0x00007f5eaf753688 in tbb::detail::r1::rml::private_worker::run (this=0x7f5eaabc7080) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc12/external/tbb/v2022.0.0-1feaa53e42a55cbacf8\
4e40dc1fc78f6/tbb-v2022.0.0/src/tbb/private_server.cpp:271
#21 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f5eaabc7080) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc12/external/tbb/v2022.0.0-1feaa53e42a55cbacf84e40dc1fc78f\
6/tbb-v2022.0.0/src/tbb/private_server.cpp:221

[...]

Current Modules:

Module: alpaka_serial_sync::CAHitNtupletAlpakaHIonPhase1:hltPixelTracksPPOnAASoA (crashed)
Module: RawDataCollectorByLabel:rawDataCollector
Module: RawDataCollectorByLabel:rawDataCollector
Module: L1TDigiToRaw:packGtStage2

A fatal system signal has occurred: segmentation violation

A simple reproducer is available here (to be run in CMSSW_15_1_0_patch1):

#!/bin/bash -ex

for i in {1..20}; do
  dirname="run_${i}"  # or replace with your actual directory naming pattern
  echo ">>> Running test in directory: ${dirname}"

  hltIntegrationTests /dev/CMSSW_15_1_0/HIon/V10 \
    -n 1000 \
    --input /store/hidata/HIRun2024B/HIEphemeralZeroBias0/RAW/v1/000/388/769/00000/1c181bd8-e9cf-4621-b68c-768ec5d49ff3.root \
    -x "--globaltag 150X_dataRun3_HLT_v1" \
    -x "--no-output" \
    -x "--eras Run3_2025 --l1-emulator uGT --l1 L1Menu_CollisionsHeavyIons2024_v1_0_6_xml" \
    -x "--open" \
    --paths "DQM_HIPixelReconstruction_v*" \
    --dir "${dirname}"

  echo ">>> Done with ${dirname}"
  echo "--------------------------------------"
done

it will eventually crash, over enough trials.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions