[Question or Exploratory Analysis] Frequency Domain Insights for Thinking Trajectories Enhancement #69
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dear S1 Team,
First, I must apologize if this comes across as abrupt - I've been away from academia for quite some time and haven't worked with Python in years. However, your open-sourced work has inspired me to dive back in, and I'd be grateful for your guidance on an exploratory analysis I've attempted.
Core Hypothesis
The remarkable performance of S1 might stem from frequency domain distinctions between
Question
andthinking_trajectories
data. Specifically:thinking_trajectories
shows a strong signal/a denser distribution on some specific dimensions after embeddings thanQuestion
Methodology
Using an optical analogy (light prism → rainbow spectrum), I developed an analysis framework:
Convert sentences to (L,D) embeddings
Taking a sentences from either
Question
orthinking_trajectories
, treats it as a series with length(L). For each char in the series, embeddings to get (1,D).Apply Fourier/Wavelet transforms per embedding dimension
As for any sentences(s) in S1 set, has a random length. To normalize it, apply a Fourier/Wavelet transforms per embedding dimension. So that we can get description for (L,D) per dimension as (F,D) where F is the frequency description for L on all dimensions. For example, (F,1) is the frequency description for for L on the 1st dimension. So if we limit F by top 5 signal(according to amplitude), we can normalization (L,D) in to (F,D) as (5,D) for each sentences.
Extract (F,D) frequency signatures
So we can get (F,D) for all 1k data, for field
Question
andthinking_trajectories
.Considering we have data point set as
Question
(Embedding Dimension × Frequency × Amplitude) as 1k, andthinking_trajectories
(Embedding Dimension × Frequency × Amplitude) as 1k, for each single data point, draws it with a very low transparency. After go though 1k data set, we will get core distribution visualized.Preliminary Findings (Qwen-0.5 about 10 sample)
At Domain, Frequency point of view, we see thinking_trajectories has a strong red line there. Not sure is back ground noise or not, as thinking_trajectories has more description than question itself, and the other distribution are different. For question field, like lines.
At Domain, Amplitude point of view, the thinking_trajectories show more collected on color blue than question with color purple, according to code, seems Domain concentration faster or earlier, as focused on the task?
Why This Matters
This could lead to:
Collaboration Request
While the initial analysis shows promise, I must admit my limitations:
Would your team consider:
I've attached my humble attempt at implementing this pipeline. While it may not meet professional standards, I hope it conveys the core idea. I would be deeply grateful for your mentorship in refining this exploration.
Technical Notes
Thank you for considering this request from an enthusiastic but rusty learner. Your work has truly inspired me to re-engage with this field.
Best regards.