Skip to content

caleb-stahl/cs190n_Team1_Final_Project

Repository files navigation

Project Overview

This repository contains the implementation and evaluation of a predictive model for video chunk download times. The project leverages datasets from Stanford's Puffer Project and Twitch live streams to highlight challenges in model generalization.


Prerequisites

  • Python: Version 3.8+
  • Required Libraries: numpy, pandas, scikit-learn, tshark, netunicorn
  • Datasets:
    • Access to the Puffer Project dataset.
    • Tools like PINOT and netUnicorn for Twitch data collection.

Setup Instructions

1. Clone the Repository

git clone https://github.com/caleb-stahl/cs190n_Team1_Final_Project.git

2. Install Dependencies

pip install -r requirements.txt

Data Preparation

Puffer Project Data

  1. Download the dataset from the Puffer Project website. To use the same data we used, download the December 4th, 2024 data.
  2. Place the dataset in the data/puffer directory.

Twitch Data

  1. Use PINOT and netUnicorn to capture streaming data (or use the data we captured). Optionally, you can use a tool like WireShark or Tshark to get packet caputures.
  2. PreProcess the data you captured with data_preprocessing.ipynb file. (See below)
  3. Place the packet captures in the data/twitch directory.

Usage

1. Replicate Data Collection

  • To replicate data collection, you must either have access to a network with PINOT and netUnicorn setup, or Wireshark/tshark on your computer.
    • If the former is the case, open and read through the data_collection.ipynb notebook for guided instructions on how to define and run your own netunicorn experiment similar to ours.
    • If the latter is the case, follow a Wireshark/tshark tutorial, such as this one for Wireshark or this one for tshark, and save the pcap after you collect it.

2. Preprocess the Data

After you have collected the data, the next step is to preprocess it so that we can get it into a format similar to the Puffer dataset.

  • First, we need to run the following command with tshark (which you should already have intalled from an earlier step):
tshark -r file.pcap -T fields -E separator=/t -e frame.time_epoch -e ip.src -e tcp.srcport -e udp.srcport -e ip.dst -e tcp.dstport -e udp.dstport -e ip.len -e ip.hdr_len -e ip.proto -e tcp.flags -e tcp.seq_raw -e tcp.ack_raw -e tcp.hdr_len -e udp.length -e tcp.analysis.retransmission -e tcp.analysis.ack_rtt -e tcp.seq -e tcp.ack >> file.csv
  • This command extracts the necesary features from the PCAP which we can further mold to fit the structure of the Puffer Data set. These are our chosen features and target variable:
    • Target Variable - Download Duration: The variable we are trying to predict (seconds).
    • Size: The size of the video chunk (bytes).
    • RTT: Time it takes for a data packet to travel from the sender to the client and back (nanoseconds).
    • Throughput: Rate at which data is transmitted over a network (Mbps).
    • Bytes Per Transmission Time: Size of the chunk/ RTT of the chunk.
    • In Flight: Number of dropped or missing packets.
  • Second, run the CSV that tshark created, following the instruction in the MarkDown cells of data_processing.ipynb to extract another CSV file containing the features above.
  • Third and finally, place this CSV file in the data/test_data/ folder of this repository.

3. Train/Evaluate the Model

  • Update the filepaths in puffer_rf.ipynb to the path of the puffer data you downloaded.
  • Follow the instructions at the end of puffer_rf.ipynb to update the file path for your collected data.
  • Read through puffer_rf.ipynb and run the cells to train the Random Forest Regressor and build a decision tree for analyzing the model.

Results

Check out the results folder to view the puffer_rf.ipynb file with our results included in the output. These results are analyzed more in our research paper as well.


References

About

Final Project by Team 1 (Caleb Stahl and Olivia Jonokuchi) for CS190n.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •