Skip to content

Triangle.dropna() refinement #1036

@genedan

Description

@genedan

Description

The behavior of Triangle.dropna() can be augmented to fine-tune its behavior. It currently only drops origin/development periods when they are all null, and when both of them are all null, it drops both of them. There are a few use cases that are unaddressed with its current implementation:

  1. Dropping periods when only some of the values are NaNs
  2. Dropping only one of the axes at a time

The following example illustrates non-intuitive behavior a user may encounter when trying to drop an origin period with a NaN.

tri = cl.Triangle(
    data={
        'origin': [1985, 1985, 1985, 1986, 1986, 1987],
        'development': [1985, 1986, 1987, 1986, 1987, 1987],
        'paid': [500, np.nan, 700, 500, 600, 500],
    },
    origin='origin',
    development='development',
    columns=['paid'],
    cumulative=True
)
print(tri)
print(tri.dropna())

         12     24     36
1985  500.0    NaN  700.0
1986  500.0  600.0    NaN
1987  500.0    NaN    NaN
         12     24     36
1985  500.0    NaN  700.0
1986  500.0  600.0    NaN
1987  500.0    NaN    NaN

The docstring does say that the entire row needs to be NaN to be dropped, but a user coming from Pandas might find this result surprising until they read the fine print. I also think it's reasonable to expect an origin period to be dropped for this reason.

I'm okay with this function not mirroring Pandas 1-1. I think some of the current departures are sensible, like not dropping periods in the middle of a triangle if they are all NaN.

Is your feature request aligned with the scope of the package?

  • Yes, absolutely!
  • No, but it's still worth discussing.
  • N/A (this request is not a codebase enhancement).

Describe the solution you'd like, or your current workaround.

Two additional parameters can fine-tune the behavior:

  1. axis
  2. how

axis controls which axis gets dropped.

tri = cl.Triangle(
    data={
        'origin': [1985, 1985, 1985, 1986, 1986, 1987],
        'development': [1985, 1986, 1987, 1986, 1987, 1987],
        'paid': [500, np.nan, 700, 500, 600, 500],
    },
    origin='origin',
    development='development',
    columns=['paid'],
    cumulative=True
)
print(tri)
print(tri.dropna(axis=3))

         12     24     36
1985  500.0    NaN  700.0
1986  500.0  600.0    NaN
1987  500.0    NaN    NaN
         12     24
1986  500.0  600.0
1987  500.0    NaN

how{‘any’, ‘all’}, default ‘any’ controls whether a row needs to have some or all of its values as NaN to be dropped.

print(tri.dropna(axis=1, how='any'))
 
         12     24
1986  500.0  600.0
1987  500.0    NaN

You can set how='all' to keep how dropna() currently behaves.

print(tri.dropna(axis=1, how='all'))
 
         12     24     36
1985  500.0    NaN  700.0
1986  500.0  600.0    NaN
1987  500.0    NaN    NaN

Do you have any additional supporting notes?

See Pandas docs for further examples. I believe only 1 axis should be allowed to be dropped at a time, this would make the resulting behavior more predictable and easier to comprehend to the user.

This will likely lead to breaking changes and result in a major version bump.

Would you be willing to contribute this ticket?

  • Yes, absolutely!
  • Yes, but I would like some help.
  • No.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Priority

    Low

    Effort

    Medium

    Scope

    Codebase

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions