Skip to content

Explore the parameter space to try and improve results. #11

@bcipolli

Description

@bcipolli

To play: python main.py --csv-file raw_dataframe.csv Then add flags to explore the parameter space:

  • --source-thresh SOURCE_THRESH Min % of events a news source must cover, to be included.
    Default 0.5; lowering this would include a broader set of news sources.

  • --min-article-length MIN_ARTICLE_LENGTH Min # words in an article (pre-parsing)
    Set to 250. Are longer articles more biased?

  • --min-vocab-length MIN_VOCAB_LENGTH Min # words in an article (post-lemmatizing, vectorizing)
    Set to 100. Are longer articles more biased?

  • --lda-min-appearances LDA_MIN_APPEARANCES Min # appearances of a word, to be included in the vocabulary
    Set to 2. Could raise this, to focus on the most common words.

  • --lda-vectorization-type {count,tfidf} Type of vectorization of article to word counts, to do.
    Set to count. Not 100% tfidf is working, but if it is, we should use it.

  • --lda-groupby {source,article} Run LDA on text separated by article, or by news source?
    Set to article right now. this just means: what are the "documents" (sets of words) sent into LDA? Could be by article, or could aggregate over source.

  • --lda-topics LDA_TOPICS # of LDA topics
    Set to 10. Clusters indicate that maybe a higher number could be helpful.

  • --lda-iters LDA_ITERS # of LDA iterations
    1500. Probably could be lowered for larger datasets.

  • --truth-frequency-thresh TRUTH_FREQUENCY_THRESH % of articles in a news event that must mention a word, for it to be "truth" / removed.
    Set to 0.5. Could be higher (e.g. 1.1 - force no words to be removed) or lower (e.g. 0.1, remove most words and leave only infrequent words for bias. Could also be implemented as a range, to say: bias words appear often, but not as often as truth words, and not as infrequently as random garbage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions