[python-data analysis] pandas time series processing

1. timestamp

1.1 create timestamp

  1. custom timestamp
  • Syntax: pd.Timestamp(ts_input,tz,year,month,day,hour,minute,second,microsecond,nanosecond,tzinfo)
  • Code example:
import pandas as pd
import pytz

# When ts_input is a string, it is generally used with the tz parameter
timestamp = pd.Timestamp(ts_input="2023-01-05", tz=pytz.timezone("Asia/Shanghai"))
print(timestamp)  # 2023-01-05 00:00:00+08:00
import pandas as pd

# When ts_input is a numeric value, it is generally used with the unit parameter
timestamp = pd.Timestamp(ts_input=1672909342.246457, unit="s")
print(timestamp)  # 2023-01-05 09:02:22.246457100
import pandas as pd

# When ts_input is not passed, it is generally necessary to specify parameters such as year,month,day,hour,minute,second, etc.
import pandas as pd

timestamp = pd.Timestamp(year=2023,month=1,day=5,hour=17,minute=8,second=34)
print(timestamp)  # 2023-01-05 17:08:34
  1. Get the current timestamp
print(pd.Timestamp.now())  # 2023-01-05 17:48:56.629418
print(pd.Timestamp.utcnow())  # 2023-01-05 09:48:56.629418+00:00

1.2 Common methods and properties of timestamp

1.2.1 common methods of timestamp

  • ts.tz_localize(tz)
    Function: Localize the timestamp in naive time zone to other time zones
    Parameters: tz: time zone identifier
ts = pd.Timestamp("2022-01-06")
print(ts.tz)  # None
ts = ts.tz_localize("Asia/Shanghai")  # Localized to Beijing time
print(ts)  # 2022-01-06 00:00:00+08:00
print(ts.value)  # 1641398400000000000, nanosecond timestamp

1.2.2 common attributes of timestamp

  • ts.value (view nanosecond integer timestamp)
ts = pd.Timestamp("2022-01-06")
print(ts.value)  # 1641398400000000000, nanosecond timestamp

1.3 Time zone and time zone conversion

1.3.1 Time zone

Time zone information in python can be viewed in the third-party library pytz

(1) Check the time zone

The two attributes all_timezones and common_timezones can be used in the pytz package to see which time zones are available.

import pytz
print(len(pytz.all_timezones))  # 595
print(pytz.all_timezones[:5])  # ['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara']
import pytz
print(len(pytz.common_timezones))  # 437
print(pytz.common_timezones[:5])  # ['Africa/Abidjan', 'Africa/Accra', 'Africa/Addis_Ababa', 'Africa/Algiers', 'Africa/Asmara']

(2) Get the time zone object

In the pytz package, you can use the pytz.timezone(zone) method to obtain the time zone object, zone is the time zone identifier, such as the time zone identifier of Shanghai, China is "Asia/Shanghai"

import pytz
tz = pytz.timezone('Asia/Shanghai')
tz  # <DstTzInfo 'Asia/Shanghai' LMT+8:06:00 STD>

1.3.2 Time zone conversion

(1) utc time zone to other time zones (two ways)
  1. timestamp.astimezone(tz=None) -> Timestamp
  • code example
import pandas as pd

utc_ts = pd.Timestamp("2022-01-05 11:45:14",tz="utc")
print(utc_ts)  # 2022-01-05 19:45:14+00:00
beijing_ts = utc_ts.astimezone(tz="Asia/Shanghai")
print(beijing_ts)  # 2022-01-05 19:45:14+08:00
  1. timestamp.tz_convert(tz=None) -> Timestamp
  • code example
import pandas as pd

utc_ts = pd.Timestamp("2022-01-05 11:45:14",tz="utc")
print(utc_ts)  # 2022-01-05 19:45:14+00:00
beijing_ts = utc_ts.tz_convert(tz="Asia/Shanghai")
print(beijing_ts)  # 2022-01-05 19:45:14+08:00
(2) Convert other time zones to utc time zone (support all time zone conversions at the same time)
  1. pd.DataFrame.tz_localize(tz, axis=0, level=None, copy=True, ambiguous='raise', nonexistent='raise') -> Series | DataFraem
  • Parameter introduction:
    tz: string or pytz.timezone object
    axis: positioning axis
    level: If the axis is a MultiIndex, target a specific level. Otherwise must be None
    copy: Copy the underlying data at the same time
    ambiguous: May produce ambiguous times when the clock moves backwards due to DST
    nonexistent: A non-existent time does not exist in a specific time zone where clocks are moved forward due to DST
  • code example
    Simulate a set of time series data, and note that the time in the data is considered to be Beijing time. Our goal is to convert this time into utc time and generate a timestamp.
import pandas as pd
import numpy as np

grade = np.random.uniform(52,100,200).astype(np.int64)
exam_dates = pd.date_range("2023-01-01", periods=200, freq="H")  # Beijing time
data = pd.DataFrame(data={"grade":grade})
data["date"] = exam_dates


One thing to pay special attention to is: there are two types of time series in pandas (essentially Timestamp objects) in terms of time zones. The first is the time series in the naive time zone, that is, there is no time zone, and the default time series is this type. The other is the time-zone aware type, that is, the time series of time zone awareness. This time series (time stamp) object stores a nanosecond-level UTC timestamp, and its value does not change during the time zone conversion process. of. Use the ts.tz method to view the time zone of the time series, and use ts.value to view the nanosecond timestamp corresponding to the time series:

print(data.index.tz)  # None, no time zone by default

Therefore, if we want to transfer this time series to another time zone, we must first determine which time zone it is in. Suppose we think that this time series is Beijing time, then we must first give the time series a time zone information, that is, localize the time series to the Beijing time zone. You can use the ts.localize(tz="Asia/Shanghai") method.

data_bj = data.localize("Asia/Shanghai")
print(data_bj.index.tz)  # Asia/Shanghai


The time series now has timezone information so we can convert it to another timezone using the ts.tz_convert(tz="utc") method.

data_utc = data_bj.tz_convert(tz="utc")


In this way, Beijing time is successfully converted to utc time. But from the above results, we can see that the timestamp we converted is of time zone-aware type, with the words '+00:00'. To remove this word, you need to convert time zone-aware to naive type.

data_utc_naive = data_utc.tz_convert(None)


If we need to further convert the date into a numeric timestamp, it can be achieved in the following two ways:
① Through the timestamp definition, calculate the starting point "1970-01-01" by subtracting the timestamp from the current time

data_utc_naive["dtime1"] = (data_utc_naive.index - pd.Timestamp("1970-01-01")) // pd.Timedelta('1ms') # utc time to millisecond timestamp


② The values ​​of Series have a view function view(dtype), we can use this method to view the numerical form of the Timestamp object

# Since the timestamp converted by the view function is in nanoseconds, we need to convert it to the precision we need by ourselves.
# The time conversion relationship below the second level is as follows: 1s=1000ms=1000us=1000ns
data_utc_naive["dtime2"] = data_utc_naive.index.values.view(dtype=np.int64) // 1000_000


Tags: Python Data Analysis pandas

Posted by marklarah on Sat, 07 Jan 2023 16:31:37 +0530