Skip to content

Hi, dont use pandas for looping, try this. #1

@MarceloBarrosVanegas

Description

@MarceloBarrosVanegas

Hi, I found your repo very helpful. I just have some minors suggestions.

First, the Rainfall Data you star with is unconventional, so I proposed to you a simple pandas based function to take a date, value to pass from long format to matrix format.

This assumes that the rainfall data has an smaller than hour granularity.

def long_to_matrix_format(csv_path):
    """
    Convert a long format CSV with 'datetime' and 'value' columns
    to a matrix format with 'Year', 'Day', and hourly columns.

    Parameters:
    - csv_path: str, path to the long format CSV

    Returns:
    - matrix_df: DataFrame in matrix format
    """

    # Read the CSV
    df = pl.read_csv(csv_path, try_parse_dates=True).to_pandas(())
    
# Convert the 'datetime' column to a pandas datetime format
    df["date"] = pd.to_datetime(df["date"])
    
# Set the DateTime column as the index
    df.set_index("date", inplace=True)
    df = df.resample("60min").sum()
    df["date"] = df.index.to_series()


    # Extract year, day of year, and hour
    df["Year"] = df["date"].dt.year
    df["Day"] = df["date"].dt.dayofyear
    df["Hour"] = (
        df["date"].dt.hour + 1
    )  # +1 to start hours from 1 to 24 instead of 0 to 23

    # Pivot the DataFrame to the desired matrix format
    matrix_df = df.pivot_table(
        index=["Year", "Day"], columns="Hour", values="value", aggfunc="sum"
    ).reset_index()

    # Rearrange columns to the desired order
    matrix_df = matrix_df[["Year", "Day"] + [i for i in range(1, 25)]]
    matrix_df.columns = ["Year", "Day"] + [f"Hour {i}" for i in range(1, 25)]

    # Fill NaN values with 0
    matrix_df.fillna(0, inplace=True)

    matrix_df["Year"] = matrix_df["Year"] - matrix_df["Year"].min() + 1

    return matrix_df.to_numpy().astype(float).copy()#this is important to make use of the numba code.

#-----------------------------------------

Then, the for loop you use to get the data1 dataframe is very slow, because you are using pandas vectorized functions on a (i, j) basis. I have a minor suggestion for this. do it pure python and then compile it in numba.

@numba.njit('float64[:,::1](float64[:,::1])')
@cc.export('nb_get_idf_rolling_sum', 'float64[:,::1](float64[:,::1])')
def get_idf_rolling_sum(df_array):
    rows = np.unique(df_array[:, 0]).size
    data = np.zeros((rows, 24))

    for i in range(rows):
        # Filter by year and drop the 'Year' and 'Day' columns
        df1_array = df_array[df_array[:, 0] == i + 1][:, 2:].T

        # Loop through 0 to 23 to calculate the max rolling sum for each window size
        for j in range(24):
            max_rolling_sum = np.NINF

            # Loop through each column
            for col in range(df1_array.shape[1]):
                # Loop through each possible window in the current column
                for row in range(df1_array.shape[0] - j):
                    rolling_sum = np.sum(df1_array[row : row + j + 1, col])

                    # Update max rolling sum if this sum is greater
                    if rolling_sum > max_rolling_sum:
                        max_rolling_sum = rolling_sum

            # Store the max rolling sum in the data array
            data[i, j] = max_rolling_sum

    return data

The numba function replaces this part.
---> for i in range(99):
df1 = (((df.where(df['Year']==i+1)).dropna()).drop(['Year','Day'],axis=1)).T
for j in range(24):
data[i][j] = max((df1.rolling(j+1).sum()).max())

#-------------------
This a time test for a 10 millon rows of a csv file of the long format meaning:

date,value
1981-01-01 01:00:00, 2.5
1981-01-01 02:00:00, 2.5
1981-01-01 03:00:00, 2.5
1981-01-01 04:00:00, 2.5
.
.
.
nth 10 million

#convert long format to matrix format
t0 = 0.5562264919281006

#proces roliing sum
t1 = 0.024176836013793945

This two functions use polars, numpy, numba and pandas.

I would be nice if you turn this into a python package maybe adding a couple of features, there are not many python packages for IDF curves construction.

Have a nice day,
Marcelo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions