Part 5. 데이터 사전 처리
- 누락 데이터 확인
- df.info()
- df.value_counts(dropna=False)
- df.isnull()
- df.notnull()
- df.isnull().sum(axis=0)
- df[‘deck’].value_counts(dropna=False)
- df[‘deck’].isnull()
- df[‘deck’].notnull()
- df[‘deck’].isnull().sum()
- 누락 데이터 제거
- df.dropna()
- df.dropna(axis=0)
- df.dropna(axis=1)
- df.dropna(axis=1, thresh=750)
- df.dropna(how=’any’)
- df.dropna(how=’any’, axis=1)
- df.dropna(how=’all’)
-
df[‘age’].dropna()
-
df.dropna(subset=’age’)
-
df[[‘age’,’deck’]].dropna()
-
df[[‘age’,’deck’]].dropna(how=’all’)
-
df.dropna(subset=[‘age’,’deck’])
- df.dropna(subset=[‘age’,’deck’], how=’all’)
-
df.dropna(subset=[‘age’,’deck’], how=’all’, ignore_index=True)
-
df.dropna(subset=[‘age’,’deck’], how=’all’, inplace=True)
- 누락 데이터 치환
-
df.fillna(100, inplace=True)
-
df[‘age’].fillna(100, inplace=True)
- m = df[‘age’].mean(axis=0)
df[‘age’].fillna(m, inplace=True) - m = df[’embark_town’].value_counts(dropna=True).idxmax()
df[’embark_town’].fillna(m, inplace=True) - df[’embark_town’].fillna(method=‘ffill’, inplace=True)
- method = ‘backfill’, ‘bfill’, ‘ffill’, None
-
- 중복 데이터 확인
- df.duplicated()
- df[‘fare’].duplicated()
- 중복 데이터 제거
- df.drop_duplicates()
-
df.drop_duplicates(subset=[‘age’, ‘fare’])
- 데이터 표준환 : 데이터 포맷을 일관성 있게 통일하는 것
- mpg -> km/l 단위 환산
- 자료형 변환
- df.dtypes
- df[‘fare’].dtypes
- df[‘fare’] = df[‘fare’].astype(‘str’)
- int, float, str, category
- df[‘pclass’].replace({1:’First’, 2:’Second’, 3:’Third’}, inplace=True)
- 범주형(카테고리) 데이터 처리
- np.histogramdf[‘fare’]
- 날짜/시간 자료형
- 파이썬 자료형 : date, time, datetime
-
import datetime as dta = dt.date(2000, 1, 2)dt.date(year=2000, month=1, day=2)b = dt.time(3,10,15)dt.time(hour=3, minute=10, second=15)c = dt.datetime(2000, 1, 2, 3, 10, 15)dt.datetime(year=2000, month=1, day=2, hour=3, minute=10, second=15)d = dt.timedelta(weeks = 8, days = 6, hours = 3, minutes = 58, seconds = 12 )a + d, c + d ## b + d : error
-
- 판다스 자료형 : timestamp
-
pd.Timestamp(2000, 1, 2)pd.Timestamp(year=2000, month=1, day=2)pd.Timestamp(year = 1991, month = 4, day = 12, minute = 2)## dt.datetime(year = 1991, month = 4, day = 12, minute = 2) 같음pd.Timestamp(‘2017-01-01T12’)pd.Timestamp(1513393355.5, unit=’s’)pd.Timestamp(1513393355, unit=’s’, tz=’US/Pacific’)pd.Timestamp(2017, 1, 1, 12)pd.Timestamp(year=2017, month=1, day=1, hour=12)pd.to_datetime(1490195805433502912, unit=’ns’)pd.to_datetime(1490195805, unit=’s’)pd.to_datetime(1, unit=’s’)pd.to_datetime([1, 2, 3], unit=’D’, origin=pd.Timestamp(‘1960-01-01’))pd.to_datetime(‘2018-10-26 12:00:00.0000000011′, format=’%Y-%m-%d %H:%M:%S.%f’)pd.to_datetime([‘2018-10-26 12:00:00’, ‘2018-10-26 13:00:15’])d = pd.to_datetime([‘2018-10-26 12:00 +0900’, ‘2018-10-26 13:00 -0500’])d[1]
-
- 파이썬 자료형 : date, time, datetime