Skip to content

[Question or Exploratory Analysis] Frequency Domain Insights for Thinking Trajectories Enhancement #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SamYuan1990
Copy link

@SamYuan1990 SamYuan1990 commented Feb 21, 2025

Dear S1 Team,

First, I must apologize if this comes across as abrupt - I've been away from academia for quite some time and haven't worked with Python in years. However, your open-sourced work has inspired me to dive back in, and I'd be grateful for your guidance on an exploratory analysis I've attempted.

it starts with my curious: as an open-source LLM, Qwen naturally has a lot of mathematical training, and there are also many different people trying to fine tune it, ideally those previous works included those S1 data for training or fine tune. So why thinking_trajectories field works/changes in S1?

Core Hypothesis

The remarkable performance of S1 might stem from frequency domain distinctions between Question and thinking_trajectories data. Specifically:

  • thinking_trajectories shows a strong signal/a denser distribution on some specific dimensions after embeddings than Question
  • These patterns could be transferable to other domains (e.g., physics) through systematic data preparation

Methodology

Using an optical analogy (light prism → rainbow spectrum), I developed an analysis framework:

  1. Token-Level Spectral Decomposition
  • Convert sentences to (L,D) embeddings
    Taking a sentences from either Question or thinking_trajectories, treats it as a series with length(L). For each char in the series, embeddings to get (1,D).

  • Apply Fourier/Wavelet transforms per embedding dimension
    As for any sentences(s) in S1 set, has a random length. To normalize it, apply a Fourier/Wavelet transforms per embedding dimension. So that we can get description for (L,D) per dimension as (F,D) where F is the frequency description for L on all dimensions. For example, (F,1) is the frequency description for for L on the 1st dimension. So if we limit F by top 5 signal(according to amplitude), we can normalization (L,D) in to (F,D) as (5,D) for each sentences.

  • Extract (F,D) frequency signatures
    So we can get (F,D) for all 1k data, for field Question and thinking_trajectories.

  1. Comparative Visualization
  • 3D rendering of (Embedding Dimension × Frequency × Amplitude)
  • Cumulative visualization with alpha blending (simulating "long-exposure" observation)
    Considering we have data point set as Question(Embedding Dimension × Frequency × Amplitude) as 1k, and thinking_trajectories(Embedding Dimension × Frequency × Amplitude) as 1k, for each single data point, draws it with a very low transparency. After go though 1k data set, we will get core distribution visualized.

Preliminary Findings (Qwen-0.5 about 10 sample)

image

At Domain, Frequency point of view, we see thinking_trajectories has a strong red line there. Not sure is back ground noise or not, as thinking_trajectories has more description than question itself, and the other distribution are different. For question field, like lines.

At Domain, Amplitude point of view, the thinking_trajectories show more collected on color blue than question with color purple, according to code, seems Domain concentration faster or earlier, as focused on the task?

Why This Matters

This could lead to:

  1. A general framework/guide for identifying or preparing "reasoning-critical" data patterns
  2. Optimized fine-tuning strategies across technical domains
  3. New evaluation metrics for data set on reasoning capability

Collaboration Request

While the initial analysis shows promise, I must admit my limitations:

  • Rusty coding skills after years away from Python
  • Limited computational resources for full-scale analysis
  • Lack of recent academic context

Would your team consider:

  1. Reviewing my approach for potential improvements
  2. Helping scale this analysis with proper resources
  3. Guiding me through the academic rigor needed?

I've attached my humble attempt at implementing this pipeline. While it may not meet professional standards, I hope it conveys the core idea. I would be deeply grateful for your mentorship in refining this exploration.


Technical Notes

  • Code structure might need professional refactoring
  • Contains likely inefficiencies (beg your patience)

Thank you for considering this request from an enthusiastic but rusty learner. Your work has truly inspired me to re-engage with this field.

Best regards.

Signed-off-by: SamYuan1990 <yy19902439@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant