1. 누락 데이터 확인
    1. df.info()
    2. df.value_counts(dropna=False)
    3. df.isnull()
    4. df.notnull()
    5. df.isnull().sum(axis=0)
    6. df[‘deck’].value_counts(dropna=False)
    7. df[‘deck’].isnull()
    8. df[‘deck’].notnull()
    9. df[‘deck’].isnull().sum()
  2. 누락 데이터 제거
    1. df.dropna()
    2. df.dropna(axis=0)
    3. df.dropna(axis=1)
    4. df.dropna(axis=1, thresh=750)
    5. df.dropna(how=’any’)
    6. df.dropna(how=’any’, axis=1)
    7. df.dropna(how=’all’)
    8. df[‘age’].dropna()
    9. df.dropna(subset=’age’)
    10. df[[‘age’,’deck’]].dropna()
    11. df[[‘age’,’deck’]].dropna(how=’all’)
    12. df.dropna(subset=[‘age’,’deck’])
    13. df.dropna(subset=[‘age’,’deck’], how=’all’)
    14. df.dropna(subset=[‘age’,’deck’], how=’all’, ignore_index=True)
    15. df.dropna(subset=[‘age’,’deck’], how=’all’, inplace=True)
  3. 누락 데이터 치환
    1. df.fillna(100, inplace=True)
    2. df[‘age’].fillna(100, inplace=True)
    3. m = df[‘age’].mean(axis=0)
      df[‘age’].fillna(m, inplace=True)
    4. m = df[’embark_town’].value_counts(dropna=True).idxmax()
      df[’embark_town’].fillna(m, inplace=True)
    5. df[’embark_town’].fillna(method=‘ffill’, inplace=True)
      1. method = ‘backfill’, ‘bfill’, ‘ffill’, None
  4. 중복 데이터 확인
    1. df.duplicated()
    2. df[‘fare’].duplicated()
  5. 중복 데이터 제거
    1. df.drop_duplicates()
    2. df.drop_duplicates(subset=[‘age’, ‘fare’])
  6. 데이터 표준환 : 데이터 포맷을 일관성 있게 통일하는 것
    1. mpg -> km/l 단위 환산
  7. 자료형 변환
    1. df.dtypes
    2. df[‘fare’].dtypes
    3. df[‘fare’] = df[‘fare’].astype(‘str’)
      1. int, float, str, category
    4. df[‘pclass’].replace({1:’First’, 2:’Second’, 3:’Third’}, inplace=True)
    5. 범주형(카테고리) 데이터 처리
      1. np.histogramdf[‘fare’]
  8. 날짜/시간 자료형
    1. 파이썬 자료형 : date, time, datetime
      1. import datetime as dt
        a = dt.date(2000, 1, 2)
        dt.date(year=2000, month=1, day=2)
        b = dt.time(3,10,15)
        dt.time(hour=3, minute=10, second=15)
        c = dt.datetime(2000, 1, 2, 3, 10, 15)
        dt.datetime(year=2000, month=1, day=2, hour=3, minute=10, second=15)
        d = dt.timedelta(weeks = 8, days = 6, hours = 3, minutes = 58, seconds = 12 )
        a + d, c + d  ## b + d : error
    2. 판다스 자료형 : timestamp
      1. pd.Timestamp(2000, 1, 2)
        pd.Timestamp(year=2000, month=1, day=2)
        pd.Timestamp(year = 1991, month = 4, day = 12, minute = 2)
        ## dt.datetime(year = 1991, month = 4, day = 12, minute = 2) 같음
        pd.Timestamp(‘2017-01-01T12’)
        pd.Timestamp(1513393355.5, unit=’s’)
        pd.Timestamp(1513393355, unit=’s’, tz=’US/Pacific’)
        pd.Timestamp(2017, 1, 1, 12)
        pd.Timestamp(year=2017, month=1, day=1, hour=12)
        pd.to_datetime(1490195805433502912, unit=’ns’)
        pd.to_datetime(1490195805, unit=’s’)
        pd.to_datetime(1, unit=’s’)
        pd.to_datetime([1, 2, 3], unit=’D’, origin=pd.Timestamp(‘1960-01-01’))
        pd.to_datetime(‘2018-10-26 12:00:00.0000000011′, format=’%Y-%m-%d %H:%M:%S.%f’)
        pd.to_datetime([‘2018-10-26 12:00:00’, ‘2018-10-26 13:00:15’])
        d = pd.to_datetime([‘2018-10-26 12:00 +0900’, ‘2018-10-26 13:00 -0500’])
        d[1]

답글 남기기

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다

error: Content is protected !!