Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up PandasDataset for long dataframes #2435

Merged
merged 16 commits into from
Nov 16, 2022
Merged

Conversation

lostella
Copy link
Contributor

@lostella lostella commented Nov 11, 2022

Issue #, if available: #2363

Description of changes: This PR is similar in spirit to #2377, but without adding a dedicated class. Also strengthened tests a bit.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Please tag this pr with at least one of these labels to make our release process faster: BREAKING, new feature, bug fix, other change, dev setup

@lostella lostella requested a review from jaheba November 11, 2022 12:29
@lostella
Copy link
Contributor Author

lostella commented Nov 11, 2022

I'm using the following example to evaluate speed (iterating twice since index checks are disabled after the first iteration)

from pathlib import Path
from time import time
import pandas as pd
from tqdm import tqdm

from gluonts.dataset.pandas import PandasDataset

df = pd.read_parquet(Path(__file__).resolve().parent / "long_df_sample.parquet")

t0 = time()
ds = PandasDataset.from_long_dataframe(
    dataframe=df,
    item_id="item_id",
    timestamp="timestamp",
    freq="M",
)
t1 = time()
print(f"construction time: {t1 - t0}")

N = 2

t0 = time()
for _ in range(N):
    for entry in tqdm(ds):
        pass
t1 = time()
print(f"average iteration time: {(t1 - t0)/N}")

Before this PR:

construction time: 32.1102340221405
100%|███████████████████████████████████████████████████████████████████████████| 25000/25000 [00:04<00:00, 6237.54it/s]
100%|██████████████████████████████████████████████████████████████████████████| 25000/25000 [00:01<00:00, 17872.74it/s]

After this PR:

construction time: 0.44616174697875977
100%|███████████████████████████████████████████████████████████████████████████| 25000/25000 [00:11<00:00, 2164.93it/s]
100%|███████████████████████████████████████████████████████████████████████████| 25000/25000 [00:08<00:00, 2803.81it/s]

So construction is much faster (not as fast as in #2377, since there the groupby operation is delayed), but iteration is slower. This is because the original code essentially pre-processes and caches all dataframes at construction time: with this PR, one can get the same by simply caching the dataset explicitly using list, in case dataset iteration is a bottleneck.

@lostella
Copy link
Contributor Author

lostella commented Nov 11, 2022

If I cache the dataset at construction time with ds = list(ds), with the PR code I get

construction + caching time: 11.770990133285522
100%|████████████████████████████████████████████████████████████████████████| 25000/25000 [00:00<00:00, 2887525.47it/s]
100%|████████████████████████████████████████████████████████████████████████| 25000/25000 [00:00<00:00, 2800833.38it/s]

which is anyway much faster than the original code.

if isinstance(data, dict):
return data.items()
if isinstance(data, (Iterable, Collection)):
first_element = first(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use toolz.peek here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it pretty much the same for collections or iterables?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we rely on the re-iterable aspect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

self._dataframes = [(None, df) for df in self.dataframes]
else: # case single dataframe
self._dataframes = [(None, self.dataframes)]
self._pairs: Any = self.dataframes.items()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that is assigned to should be listed as a definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a type annotation on the class level

series.index, DatetimeIndexOpsMixin
), "series index has to be a DatetimeIndex."
return series.to_frame(name="target")
def pair_with_item_id(obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotation is missing.

I think this should use a NamedTuple. I think it's generally bad style to rely on some implied meaning through ordering.

if self.timestamp:
df = df.set_index(keys=self.timestamp)
if isinstance(self.dataframes, (pd.Series, pd.DataFrame)):
self._dataframes = [self.dataframes]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have one with and one without underscore?

Copy link
Contributor Author

@lostella lostella Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long story short: some test (test_hts_to_dataset) is accessing the dataframes attribute and was breaking with my change, and I'm trying to avoid touching tests at all (in the sense of modifying existing assertions). But I can also do that.

"""

dataframes: Union[
pd.DataFrame,
pd.Series,
List[pd.DataFrame],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lostella lostella added bug fix (one of pr required labels) pending v0.11.x backport This contains a fix to be backported to the v0.11.x branch labels Nov 12, 2022
@jaheba
Copy link
Contributor

jaheba commented Nov 16, 2022

Do we know what impact on memory consumption this change has?

@lostella
Copy link
Contributor Author

Do we know what impact on memory consumption this change has?

I used scalene with Python 3.8.13, on dev vs this PR, using the data from #2363. Max memory consumption is as follows:

  • dev: 575.1 MB
  • this PR: 315.0 MB

@lostella lostella merged commit d6aec0a into awslabs:dev Nov 16, 2022
@lostella lostella deleted the fix-2363 branch November 16, 2022 12:27
lostella added a commit to lostella/gluonts that referenced this pull request Nov 21, 2022
@lostella lostella mentioned this pull request Nov 21, 2022
lostella added a commit that referenced this pull request Nov 21, 2022
* Fix rotbaum random seed and num_samples argument. (#2408)

* Hierarchical: Make sure the input S matrix is of right dtype (#2415)

* Speed up `PandasDataset` for long dataframes (#2435)

* Fix frequency inference in `PandasDataset` (#2442)

* Mypy fixes (#2427)

* Fix new mypy complaints.

* Also remove ininspection comments.

* Tests: Change Python versions. (#2448)

Remove Python 3.6, instead test up to 3.10.

Co-authored-by: Sigrid Passano Hellan <[email protected]>
Co-authored-by: Syama Sundar Rangapuram <[email protected]>
Co-authored-by: Jasper <[email protected]>
@lostella lostella removed the pending v0.11.x backport This contains a fix to be backported to the v0.11.x branch label Nov 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix (one of pr required labels)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants