Search code examples
pandasmatlabcsvdatetimetimezone

How to read Matlab datenum from CSV into Pandas as time-zone aware datetime?


I have a Matlab datenum column in CSV file, e.g. 734622.540972. This is for timezone of 'America/Vancouver'. This serialized date corresponds to Apr 29, 2011 12:58:59.981. I want to be able to read this in pandas in an efficient manner. I have tried the following code but get Pandas warnings. Can someone explain why the warning is there and how to modify this code so that it is compliant with future version of Pandas:

import datetime as dt
import pytz as tz
import pandas as pd

# this is a function to convert matlab_datenum to tz-aware Python datetime. It works correctly.


def convert_matlab_date(matlab_datenum):
    python_datetime = dt.datetime.fromordinal(int(
        matlab_datenum)) + dt.timedelta(days=matlab_datenum % 1) - dt.timedelta(days=366)

    some_tz = tz.timezone('America/Vancouver')
    new_dt = some_tz.localize(python_datetime)

    return new_dt


full_fname = "some_file_with_matlab_dates.csv"
col_names = ("col1", "col2", "col3", "col4",
             "col5")

some_df = pd.read_csv(full_fname,  header=None, names=col_names,
                      date_parser=convert_matlab_date)

This gives a pandas FutureWarning about using pd.to_datetime . Output below:

c:\pd_matlab_conversion.py:29: FutureWarning: 
        Use pd.to_datetime instead.

  some_df = pd.read_csv(full_fname,  header=None, names=col_names,

some_df
Out[89]: 
                                    col1    col2    col3    col4       col5
2011-04-29 06:29:59.971196-07:00  346.86  346.67  347.40  346.74   163300.0
2011-04-29 06:31:00.019201-07:00  346.71  346.71  347.01  346.81    48500.0
2011-04-29 06:31:59.980805-07:00  346.81  346.80  347.12  347.09    55600.0
2011-04-29 06:33:00.028800-07:00  347.09  347.07  347.45  347.32    42300.0
2011-04-29 06:33:59.990404-07:00  347.29  347.27  347.90  347.80   104400.0 

CSV Snippet below:

734622.270833,346.860000,346.670000,347.400000,346.740000,163300.000000
734622.271528,346.710000,346.710000,347.010000,346.810000,48500.000000
734622.272222,346.810000,346.800000,347.120000,347.090000,55600.000000
734622.272917,347.090000,347.070000,347.450000,347.320000,42300.000000
734622.273611,347.290000,347.270000,347.900000,347.800000,104400.000000
734622.274306,347.770000,347.750000,347.900000,347.900000,51900.000000
734622.275000,347.860000,347.830000,348.420000,348.220000,99100.000000
734622.275694,348.280000,348.170000,348.630000,348.530000,82600.000000
734622.276389,348.560000,348.550000,348.890000,348.880000,74700.000000
734622.277083,348.850000,348.820000,349.330000,349.290000,101500.000000
734622.277778,349.290000,349.290000,349.900000,349.880000,134600.000000
734622.278472,349.850000,349.580000,349.930000,349.740000,109500.000000
734622.279167,349.860000,349.610000,350.000000,349.730000,119100.000000

Solution

  • matlab's datenum is

    A serial date number represents the whole and fractional number of days from a fixed, preset date (January 0, 0000) in the proleptic ISO calendar.

    pandas.to_datetime can convert fractional days to datetime, however year 0000 would be out of range. So you need to normalize that epoch to a reasonable one first. As shown here, you can do that like

    import pandas as pd
    
    df = pd.DataFrame({"datenum": [734622.270833,734622.271528,734622.272222,734622.272917,
                                   734622.273611,734622.274306,734622.275000,734622.275694,
                                   734622.276389,734622.277083,734622.277778,734622.278472,
                                   734622.279167]})
    
    epoch = 719529 # full days between 0000-01-00 and 1970-01-01
    
    # normalize to 1970-01-01 as epoch, convert to datetime and set time zone
    df["datetime"] = pd.to_datetime(df["datenum"]-epoch, unit="d").dt.tz_localize("America/Vancouver")
    
    df["datetime"]
    
    0    2011-04-29 06:29:59.971196416-07:00
    1    2011-04-29 06:31:00.019201024-07:00
    2    2011-04-29 06:31:59.980805120-07:00
    3    2011-04-29 06:33:00.028799488-07:00
    4    2011-04-29 06:33:59.990403584-07:00
    5    2011-04-29 06:35:00.038397952-07:00
    6    2011-04-29 06:36:00.000002048-07:00
    7    2011-04-29 06:36:59.961595904-07:00
    8    2011-04-29 06:38:00.009600512-07:00
    9    2011-04-29 06:38:59.971204352-07:00
    10   2011-04-29 06:40:00.019198976-07:00
    11   2011-04-29 06:40:59.980803072-07:00
    12   2011-04-29 06:42:00.028797440-07:00
    Name: datetime, dtype: datetime64[ns, America/Vancouver]