Hi, dont use pandas for looping, try this.

Hi, I found your repo very helpful. I just have some minors suggestions.

First, the Rainfall Data you star with is unconventional, so I proposed to you a simple pandas based function to take a date, value to pass from long format to matrix format.

This assumes that the rainfall data has an smaller than hour granularity.

```
def long_to_matrix_format(csv_path):
    """
    Convert a long format CSV with 'datetime' and 'value' columns
    to a matrix format with 'Year', 'Day', and hourly columns.

    Parameters:
    - csv_path: str, path to the long format CSV

    Returns:
    - matrix_df: DataFrame in matrix format
    """

    # Read the CSV
    df = pl.read_csv(csv_path, try_parse_dates=True).to_pandas(())
    
# Convert the 'datetime' column to a pandas datetime format
    df["date"] = pd.to_datetime(df["date"])
    
# Set the DateTime column as the index
    df.set_index("date", inplace=True)
    df = df.resample("60min").sum()
    df["date"] = df.index.to_series()


    # Extract year, day of year, and hour
    df["Year"] = df["date"].dt.year
    df["Day"] = df["date"].dt.dayofyear
    df["Hour"] = (
        df["date"].dt.hour + 1
    )  # +1 to start hours from 1 to 24 instead of 0 to 23

    # Pivot the DataFrame to the desired matrix format
    matrix_df = df.pivot_table(
        index=["Year", "Day"], columns="Hour", values="value", aggfunc="sum"
    ).reset_index()

    # Rearrange columns to the desired order
    matrix_df = matrix_df[["Year", "Day"] + [i for i in range(1, 25)]]
    matrix_df.columns = ["Year", "Day"] + [f"Hour {i}" for i in range(1, 25)]

    # Fill NaN values with 0
    matrix_df.fillna(0, inplace=True)

    matrix_df["Year"] = matrix_df["Year"] - matrix_df["Year"].min() + 1

    return matrix_df.to_numpy().astype(float).copy()#this is important to make use of the numba code.
```


#-----------------------------------------

Then, the for loop you use to get the data1 dataframe is very slow, because you are using pandas vectorized functions on a (i, j) basis. I have a minor suggestion for this. do it  pure python and then compile it in numba.


```
@numba.njit('float64[:,::1](float64[:,::1])')
@cc.export('nb_get_idf_rolling_sum', 'float64[:,::1](float64[:,::1])')
def get_idf_rolling_sum(df_array):
    rows = np.unique(df_array[:, 0]).size
    data = np.zeros((rows, 24))

    for i in range(rows):
        # Filter by year and drop the 'Year' and 'Day' columns
        df1_array = df_array[df_array[:, 0] == i + 1][:, 2:].T

        # Loop through 0 to 23 to calculate the max rolling sum for each window size
        for j in range(24):
            max_rolling_sum = np.NINF

            # Loop through each column
            for col in range(df1_array.shape[1]):
                # Loop through each possible window in the current column
                for row in range(df1_array.shape[0] - j):
                    rolling_sum = np.sum(df1_array[row : row + j + 1, col])

                    # Update max rolling sum if this sum is greater
                    if rolling_sum > max_rolling_sum:
                        max_rolling_sum = rolling_sum

            # Store the max rolling sum in the data array
            data[i, j] = max_rolling_sum

    return data

```

The numba function replaces this part.
---> for i in range(99):
          df1 = (((df.where(df['Year']==i+1)).dropna()).drop(['Year','Day'],axis=1)).T
          for j in range(24):
              data[i][j] = max((df1.rolling(j+1).sum()).max())


#-------------------
This a time test for a 10 millon rows of a csv file  of the long format meaning:

date,value
1981-01-01 01:00:00, 2.5
1981-01-01 02:00:00, 2.5
1981-01-01 03:00:00, 2.5
1981-01-01 04:00:00, 2.5
.
.
.
nth 10 million



#convert long format to matrix format
t0 = 0.5562264919281006

#proces roliing sum
t1 = 0.024176836013793945


This two functions use polars, numpy, numba and pandas.

I would be nice if you turn this into a python package maybe adding a couple of features, there are not many python packages for IDF curves construction. 

Have a nice day,
Marcelo.









Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hi, dont use pandas for looping, try this. #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Hi, dont use pandas for looping, try this. #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions