-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up PandasDataset
for long dataframes
#2435
Conversation
I'm using the following example to evaluate speed (iterating twice since index checks are disabled after the first iteration) from pathlib import Path
from time import time
import pandas as pd
from tqdm import tqdm
from gluonts.dataset.pandas import PandasDataset
df = pd.read_parquet(Path(__file__).resolve().parent / "long_df_sample.parquet")
t0 = time()
ds = PandasDataset.from_long_dataframe(
dataframe=df,
item_id="item_id",
timestamp="timestamp",
freq="M",
)
t1 = time()
print(f"construction time: {t1 - t0}")
N = 2
t0 = time()
for _ in range(N):
for entry in tqdm(ds):
pass
t1 = time()
print(f"average iteration time: {(t1 - t0)/N}") Before this PR:
After this PR:
So construction is much faster (not as fast as in #2377, since there the |
If I cache the dataset at construction time with
which is anyway much faster than the original code. |
src/gluonts/dataset/pandas.py
Outdated
if isinstance(data, dict): | ||
return data.items() | ||
if isinstance(data, (Iterable, Collection)): | ||
first_element = first(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use toolz.peek
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it pretty much the same for collections or iterables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we rely on the re-iterable aspect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
src/gluonts/dataset/pandas.py
Outdated
self._dataframes = [(None, df) for df in self.dataframes] | ||
else: # case single dataframe | ||
self._dataframes = [(None, self.dataframes)] | ||
self._pairs: Any = self.dataframes.items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything that is assigned to should be listed as a definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I get it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a type annotation on the class level
src/gluonts/dataset/pandas.py
Outdated
series.index, DatetimeIndexOpsMixin | ||
), "series index has to be a DatetimeIndex." | ||
return series.to_frame(name="target") | ||
def pair_with_item_id(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotation is missing.
I think this should use a NamedTuple
. I think it's generally bad style to rely on some implied meaning through ordering.
src/gluonts/dataset/pandas.py
Outdated
if self.timestamp: | ||
df = df.set_index(keys=self.timestamp) | ||
if isinstance(self.dataframes, (pd.Series, pd.DataFrame)): | ||
self._dataframes = [self.dataframes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have one with and one without underscore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long story short: some test (test_hts_to_dataset
) is accessing the dataframes
attribute and was breaking with my change, and I'm trying to avoid touching tests at all (in the sense of modifying existing assertions). But I can also do that.
""" | ||
|
||
dataframes: Union[ | ||
pd.DataFrame, | ||
pd.Series, | ||
List[pd.DataFrame], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make use of TypeVar here: https://stackoverflow.com/questions/59933946/difference-between-typevart-a-b-and-typevart-bound-uniona-b
Do we know what impact on memory consumption this change has? |
* Fix rotbaum random seed and num_samples argument. (#2408) * Hierarchical: Make sure the input S matrix is of right dtype (#2415) * Speed up `PandasDataset` for long dataframes (#2435) * Fix frequency inference in `PandasDataset` (#2442) * Mypy fixes (#2427) * Fix new mypy complaints. * Also remove ininspection comments. * Tests: Change Python versions. (#2448) Remove Python 3.6, instead test up to 3.10. Co-authored-by: Sigrid Passano Hellan <[email protected]> Co-authored-by: Syama Sundar Rangapuram <[email protected]> Co-authored-by: Jasper <[email protected]>
Issue #, if available: #2363
Description of changes: This PR is similar in spirit to #2377, but without adding a dedicated class. Also strengthened tests a bit.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Please tag this pr with at least one of these labels to make our release process faster: BREAKING, new feature, bug fix, other change, dev setup