Search code examples
pythonpandasloopscompare

Comparing One Record in Pandas Dataframe To All Other Records in Dataframe


I have a situation where I want to compare every value in one column of a dataframe against every other value in the same column. In this case, for every product, I want to see for Hyundais the comparison to Kias in each warehouse

There are ~10,000 products (500,000 records total) that I want to compare - every products against every other products in the dataset.

Please note that the concat of warehouses in the sample code below is to handle situations where a cross join is needed (warehouse for one product doesn't carry the other)

I'm trying to see if there is an easier/quicker way to do this rather than do the double "for" as per the below. The output is correct but the code takes over a day to run.

How is the most efficient way to make this work?

import pandas as pd

data = {'productid' : ['hyundai', 'hyundai', 'hyundai', 'kia','kia', 'kia'],
    'warehouse' : ['New Jersey', 'New York', 'California', 'New Jersey', 'New York', 'California'],
    'pct_total' : [35, 45, 20,65,55,80]}
df = pd.DataFrame(data)


dfoutput2 = pd.DataFrame()
for productid1 in df.productid.unique():
    for productid2 in df.productid.unique():
        if productid1 != productid2:
            df1=df[df.productid==productid1]
            df2=df[df.productid==productid2]

            allwarehouses=pd.concat(  [df1.warehouse, df2.warehouse])
            allwarehouses=allwarehouses.drop_duplicates()
            merged1=pd.merge(allwarehouses,df1, how="left", on=["warehouse"])
            mergedfinal=pd.merge(merged1,df2, how="left", on=["warehouse"])
            mergedfinal['lowervalue'] = mergedfinal[['pct_total_x','pct_total_y']].min(axis=1)
            dfoutput2=pd.concat(  [dfoutput2, mergedfinal])
print("done")

Solution

  • Merge against itself and drop duplicates:

    _df = df.merge(df, on='warehouse')
    _df = _df[lambda d: d['productid_x'] != d['productid_y']]
    _df.sort_values(by=['productid_x', 'warehouse'])
    #    productid_x   warehouse  pct_total_x productid_y  pct_total_y
    # 9      hyundai  California           20         kia           80
    # 1      hyundai  New Jersey           35         kia           65
    # 5      hyundai    New York           45         kia           55
    # 10         kia  California           80     hyundai           20
    # 2          kia  New Jersey           65     hyundai           35
    # 6          kia    New York           55     hyundai           45
    

    However, if your goal is just to find the lowest pct_total by warehouse, this may be better solved with a group-by, df.loc[df.groupby('warehouse')['pct_total'].idxmin()].