'''
원본 소스코드에 대한 설명:

### 작성 : [PinkWink](http://pinkwink.kr) 

* 강남3구의 주민들이 자신들이 거주하는 구의 체감 안전도를 높게 생각한다는 기사를 확인해 보도록 한다
* 기사 원문 http://news1.kr/articles/?1911504
* 작성 : PinkWink http://pinkwink.kr

* Matplotlib의 heatmap 등을 그릴때 cmap의 디폴트 설정이 변경되어 heatmap 등에서 cmap을 적용할 때 옵션을 잡아주어야 교재와 동일한 효과가 나타난다. (소스코드에 모두 반영됨)
* Folium이 0.4.0으로 판올림 되면서 choropleth 명령에서 geo_str 옵션명이 geo_data 옵션명으로 변경됨. (소스코드에 모두 반영)
* Folium이 0.4.0으로 판올림 되면서 circle marker 적용할때, fill=True 옵션을 반듯이 사용해야 함. (소스코드에 모두 반영)

'''

'\n원본 소스코드에 대한 설명:\n\n### 작성 : [PinkWink](http://pinkwink.kr) \n\n* 강남3구의 주민들이 자신들이 거주하는 구의 체감 안전도를 높게 생각한다는 기사를 확인해 보도록 한다\n* 기사 원문 http://news1.kr/articles/?1911504\n* 작성 : PinkWink http://pinkwink.kr\n\n* Matplotlib의 heatmap 등을 그릴때 cmap의 디폴트 설정이 변경되어 heatmap 등에서 cmap을 적용할 때 옵션을 잡아주어야 교재와 동일한 효과가 나타난다. (소스코드에 모두 반영됨)\n* Folium이 0.4.0으로 판올림 되면서 choropleth 명령에서 geo_str 옵션명이 geo_data 옵션명으로 변경됨. (소스코드에 모두 반영)\n* Folium이 0.4.0으로 판올림 되면서 circle marker 적용할때, fill=True 옵션을 반듯이 사용해야 함. (소스코드에 모두 반영)\n\n'

강남 3구는 안전한가?

9월 11일 수요일 강의 발표에 대한 부분

교재에 없는 내용이 포함되어 있습니다.

원본 소스코드에 없는 코드와 주석이 포함되어 있습니다.

!!!!주석에 대하여!!!!
중복된 내용은 코드 주석을 스킵 하였습니다.

데이터 정리하기

필요한 모듈을 import한다.

import numpy as np
import pandas as pd
'''
cp949, utf-8, iso8859-1, euc-kr

https://docs.python.org/3/library/codecs.html#standard-encodings
여기서 찾아보면 됩니다.
'''

받은 데이터(csv) 파일을 읽는다, 콤마(,)로 천단위가 구분되어 있고, 한글 엔코딩은 euc-kr이다
교재에서 접근하는 방식으로 데이터를 얻으로 가면 교재 집필하던 때와 데이터의 형식이 변경되어 있다.
지금은 데이터를 얻는 것이 실제 데이터라는 것에 대한 증명일 뿐이므로, 해당 데이터를 Github에서 배포하는 데이터를 다운받는 것으로 한다.

# 파이썬 document(공식문서)에는 인코딩 방식에 대해 자세히 설명이 되어있는 부분이 있습니다.
# 여기에서 나열된 인코딩 방식을 전부 웹 크롤링으로 가져와, 향후 csv 파일 읽을 때 하나씩 넣어보려고 합니다.
# 챕터 2는 웹 크롤링에 관한 부분이 아니므로, 크롤링 코드 해설은 스킵합니다.
import requests
from bs4 import BeautifulSoup
req = requests.get('https://docs.python.org/3/library/codecs.html#standard-encodings').content
bs1 = BeautifulSoup(req, 'html5lib')
bs2 = bs1.find('div', id='standard-encodings')
bs3 = bs2.find('table', class_='docutils align-center')
bs4 = bs3.find('tbody')
bs5 = bs4.findAll('tr')
codec_list = []
for i in bs5:
    codec_list.append(i.find('p').text)

# 파이썬에서 데이터프레임을 다루는 도구인 pandas 를 가져와
# 향후 pd 라는 축약어로 사용하기로 합니다.
import pandas as pd

# codec_list 에는, 파이썬에서 사용 가능한 모든 인코딩 방식에 대한 정보가 들어있습니다.
# for 문은 in 을 기준으로 오른쪽에 있는 묶음에서 하나씩 뽑아와 왼쪽의 변수에 집어넣고
# 집어넣을 때마다 들여쓰기 된 부분을 실행하는 의미인 반복 기능을 수행하는 문법입니다.
for i in codec_list:
    # try ~ except 는 문법에서 예외처리 라는 부분입니다.
    try:
        crime_anal_police = pd.read_csv('../data/02. crime_in_Seoul.csv', thousands=',', encoding=i)
        print(i)
        print(crime_anal_police.columns[:4])
    except:
        pass

cp037
Index(['^Ü¯Ý½Ò', ']Ö{ó¾ÿ]Ù', ']Ö{ó^ô^E', '^Ý§§¾ÿ]Ù'], dtype='object')
cp273
Index(['¢]‾Ý½Ò', '|\äó¾ÿ|Ù', '|\äó¢ô¢E', '¢Ý@@¾ÿ|Ù'], dtype='object')
cp437
Index(['░ⁿ╝¡╕φ', '╗∞└╬ ╣▀╗²', '╗∞└╬ ░╦░┼', '░¡╡╡ ╣▀╗²'], dtype='object')
cp500
Index(['¢Ü¯Ý½Ò', '|Ö{ó¾ÿ|Ù', '|Ö{ó¢ô¢E', '¢Ý§§¾ÿ|Ù'], dtype='object')
cp720
Index(['░ⁿ╝ص╕و', '╗ه└╬ ╣▀╗²', '╗ه└╬ ░╦░┼', '░ص╡╡ ╣▀╗²'], dtype='object')
cp737
Index(['░ⁿ╝φ╕Ί', '╗Ή└╬ ╣▀╗²', '╗Ή└╬ ░╦░┼', '░φ╡╡ ╣▀╗²'], dtype='object')
cp775
Index(['░³╝ŁĖĒ', '╗ņ└╬ ╣▀╗²', '╗ņ└╬ ░╦░┼', '░ŁĄĄ ╣▀╗²'], dtype='object')
cp850
Index(['░³╝¡©Ý', '╗ý└╬ ╣▀╗²', '╗ý└╬ ░╦░┼', '░¡ÁÁ ╣▀╗²'], dtype='object')
cp852
Index(['░Ř╝şŞÝ', '╗ý└╬ ╣▀╗ř', '╗ý└╬ ░╦░┼', '░şÁÁ ╣▀╗ř'], dtype='object')
cp855
Index(['░Ч╝ГИь', '╗В└╬ ╣▀╗§', '╗В└╬ ░╦░┼', '░Гхх ╣▀╗§'], dtype='object')
cp858
Index(['░³╝¡©Ý', '╗ý└╬ ╣▀╗²', '╗ý└╬ ░╦░┼', '░¡ÁÁ ╣▀╗²'], dtype='object')
cp860
Index(['░ⁿ╝¡╕φ', '╗∞└╬ ╣▀╗²', '╗∞└╬ ░╦░┼', '░¡╡╡ ╣▀╗²'], dtype='object')
cp861
Index(['░ⁿ╝¡╕φ', '╗∞└╬ ╣▀╗²', '╗∞└╬ ░╦░┼', '░¡╡╡ ╣▀╗²'], dtype='object')
cp862
Index(['░ⁿ╝¡╕φ', '╗∞└╬ ╣▀╗²', '╗∞└╬ ░╦░┼', '░¡╡╡ ╣▀╗²'], dtype='object')
cp863
Index(['░ⁿ╝¾╕φ', '╗∞└╬ ╣▀╗²', '╗∞└╬ ░╦░┼', '░¾╡╡ ╣▀╗²'], dtype='object')
cp864
Index(['٠ﻙﺱﺝ٨ﻎ', '؛ﻌ¢ﺧ ٩ﻉ؛ﻱ', '؛ﻌ¢ﺧ ٠ﺛ٠ﻊ', '٠ﺝ٥٥ ٩ﻉ؛ﻱ'], dtype='object')
cp865
Index(['░ⁿ╝¡╕φ', '╗∞└╬ ╣▀╗²', '╗∞└╬ ░╦░┼', '░¡╡╡ ╣▀╗²'], dtype='object')
cp866
Index(['░№╝н╕э', '╗ь└╬ ╣▀╗¤', '╗ь└╬ ░╦░┼', '░н╡╡ ╣▀╗¤'], dtype='object')
cp869
Index(['░ΰ╝ΙΝς', '╗σ└╬ ╣▀╗ώ', '╗σ└╬ ░╦░┼', '░ΙΚΚ ╣▀╗ώ'], dtype='object')
cp875
Index(['£υπϋ', 'τ{‘ώ¦τ', 'τ{‘£ω£E', '£πίίώ¦τ'], dtype='object')
cp949
Index(['관서명', '살인 발생', '살인 검거', '강도 발생'], dtype='object')
cp1006
Index(['ﺍﮰﺙﺕﻥ', 'ﭨﮞﭺﺳ ﺗﻑﭨﮮ', 'ﭨﮞﭺﺳ ﺍﺯﺍﺧ', 'ﺍﭖﭖ ﺗﻑﭨﮮ'], dtype='object')
cp1026
Index(['¢"¯$½Ò', '|#çó¾ÿ|Ù', '|#çó¢ô¢E', '¢$§§¾ÿ|Ù'], dtype='object')
cp1125
Index(['░№╝н╕э', '╗ь└╬ ╣▀╗¤', '╗ь└╬ ░╦░┼', '░н╡╡ ╣▀╗¤'], dtype='object')
cp1140
Index(['^Ü¯Ý½Ò', ']Ö{ó¾ÿ]Ù', ']Ö{ó^ô^E', '^Ý§§¾ÿ]Ù'], dtype='object')
cp1250
Index(['°üĽ¸í', '»ěŔÎ ąß»ý', '»ěŔÎ °Ë°Ĺ', '°µµ ąß»ý'], dtype='object')
cp1251
Index(['°ьјён', '»мАО №Я»э', '»мАО °Л°Е', '°µµ №Я»э'], dtype='object')
cp1252
Index(['°ü¼¸í', '»ìÀÎ ¹ß»ý', '»ìÀÎ °Ë°Å', '°µµ ¹ß»ý'], dtype='object')
cp1253
Index(['°όΌΈν', '»μΐΞ Ήί»ύ', '»μΐΞ °Λ°Ε', '°µµ Ήί»ύ'], dtype='object')
cp1254
Index(['°ü¼¸í', '»ìÀÎ ¹ß»ı', '»ìÀÎ °Ë°Å', '°µµ ¹ß»ı'], dtype='object')
cp1256
Index(['°ü¼¸ي', '»ىہخ ¹ك»‎', '»ىہخ °ث°إ', '°µµ ¹ك»‎'], dtype='object')
cp1257
Index(['°ü¼øķ', '»ģĄĪ ¹ß»ż', '»ģĄĪ °Ė°Å', '°µµ ¹ß»ż'], dtype='object')
cp1258
Index(['°ü¼¸í', '»́ÀÎ ¹ß»ư', '»́ÀÎ °Ë°Å', '°µµ ¹ß»ư'], dtype='object')
euc_jp
Index(['淫辞誤', '詞昔 降持', '詞昔 伊暗', '悪亀 降持'], dtype='object')
euc_jis_2004
Index(['淫辞誤', '詞昔 降持', '詞昔 伊暗', '悪亀 降持'], dtype='object')
euc_jisx0213
Index(['淫辞誤', '詞昔 降持', '詞昔 伊暗', '悪亀 降持'], dtype='object')
euc_kr
Index(['관서명', '살인 발생', '살인 검거', '강도 발생'], dtype='object')
gb2312
Index(['包辑疙', '混牢 惯积', '混牢 八芭', '碍档 惯积'], dtype='object')
gbk
Index(['包辑疙', '混牢 惯积', '混牢 八芭', '碍档 惯积'], dtype='object')
gb18030
Index(['包辑疙', '混牢 惯积', '混牢 八芭', '碍档 惯积'], dtype='object')
latin_1
Index(['°ü¼¸í', '»ìÀÎ ¹ß»ý', '»ìÀÎ °Ë°Å', '°µµ ¹ß»ý'], dtype='object')
iso8859_2
Index(['°üź¸í', 'ťěŔÎ šßťý', 'ťěŔÎ °Ë°Ĺ', '°ľľ šßťý'], dtype='object')
iso8859_4
Index(['°üŧ¸í', 'ģėĀÎ šßģũ', 'ģėĀÎ °Ë°Å', '°ĩĩ šßģũ'], dtype='object')
iso8859_5
Index(['АќМИэ', 'ЛьРЮ ЙпЛ§', 'ЛьРЮ АЫАХ', 'АЕЕ ЙпЛ§'], dtype='object')
iso8859_9
Index(['°ü¼¸í', '»ìÀÎ ¹ß»ı', '»ìÀÎ °Ë°Å', '°µµ ¹ß»ı'], dtype='object')
iso8859_10
Index(['°üžļí', 'ŧėĀÎ đßŧý', 'ŧėĀÎ °Ë°Å', '°ĩĩ đßŧý'], dtype='object')
iso8859_13
Index(['°ü¼øķ', '»ģĄĪ ¹ß»ż', '»ģĄĪ °Ė°Å', '°µµ ¹ß»ż'], dtype='object')
iso8859_14
Index(['Ḟüỳẁí', 'ṠìÀÎ ṗßṠý', 'ṠìÀÎ ḞËḞÅ', 'Ḟṁṁ ṗßṠý'], dtype='object')
iso8859_15
Index(['°üŒží', '»ìÀÎ ¹ß»ý', '»ìÀÎ °Ë°Å', '°µµ ¹ß»ý'], dtype='object')
iso8859_16
Index(['°üŒží', '»ìÀÎ čß»ę', '»ìÀÎ °Ë°Ć', '°”” čß»ę'], dtype='object')
koi8_r
Index(['╟Э╪╜╦М', '╩Люн ╧ъ╩Щ', '╩Люн ╟к╟е', '╟╜╣╣ ╧ъ╩Щ'], dtype='object')
koi8_u
Index(['╟Э╪ґ╦М', '╩Люн ╧ъ╩Щ', '╩Люн ╟к╟е', '╟ґ╣╣ ╧ъ╩Щ'], dtype='object')
kz1048
Index(['°ьәён', '»мАО №Я»э', '»мАО °Л°Е', '°µµ №Я»э'], dtype='object')
mac_cyrillic
Index(['∞ьЉ≠Єн', 'їмјќ єяїэ', 'їмјќ ∞Ћ∞≈', '∞≠µµ єяїэ'], dtype='object')
mac_greek
Index(['ΑϋΦ≠Ημ', 'ΜλάΈ ΙΏΜΐ', 'ΜλάΈ ΑΥΑ≈', 'Α≠ΒΒ ΙΏΜΐ'], dtype='object')
mac_iceland
Index(['∞¸º≠∏Ì', 'ªÏ¿Œ πþª˝', 'ªÏ¿Œ ∞À∞≈', '∞≠µµ πþª˝'], dtype='object')
mac_latin2
Index(['įŁľ≠łŪ', 'Ľžņő ĻŖĽż', 'Ľžņő įňįŇ', 'į≠ĶĶ ĻŖĽż'], dtype='object')
mac_roman
Index(['∞¸º≠∏Ì', 'ªÏ¿Œ πﬂª˝', 'ªÏ¿Œ ∞À∞≈', '∞≠µµ πﬂª˝'], dtype='object')
mac_turkish
Index(['∞¸º≠∏Ì', 'ªÏ¿Œ πşª˝', 'ªÏ¿Œ ∞À∞≈', '∞≠µµ πşª˝'], dtype='object')
ptcp154
Index(['°ьјӯён', '»мАО №Я»э', '»мАО °Л°Е', '°ӯөө №Я»э'], dtype='object')

# 위에서 발견한 cp949 라는 방식으로 인코딩을 하여 이후 과정을 진행합니다.(교재에는 euc-kr)
crime_anal_police = pd.read_csv('../data/02. crime_in_Seoul.csv', thousands=',', encoding='cp949')
crime_anal_police.head()

	관서명	살인 발생	살인 검거	강도 발생	강도 검거	강간 발생	강간 검거	절도 발생	절도 검거	폭력 발생	폭력 검거
0	중부서	2	2	3	2	105	65	1395	477	1355	1170
1	종로서	3	3	6	5	115	98	1070	413	1278	1070
2	남대문서	1	0	6	4	65	46	1153	382	869	794
3	서대문서	2	2	5	4	154	124	1812	738	2056	1711
4	혜화서	3	2	5	4	96	63	1114	424	1015	861

# 너무나도 당연하게 pd. 뒤에 read_csv 라는 것을 사용하는데요
# 점 찍고 나서 뒤에 어떤 것들을 사용할 수 있는지 알아보는 방법을 한 가지 소개합니다.
dir(pd)

['Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NaT',
 'Panel',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseArray',
 'SparseDataFrame',
 'SparseDtype',
 'SparseSeries',
 'TimeGrouper',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_hashtable',
 '_lib',
 '_libs',
 '_np_version_under1p13',
 '_np_version_under1p14',
 '_np_version_under1p15',
 '_np_version_under1p16',
 '_np_version_under1p17',
 '_tslib',
 '_version',
 'api',
 'array',
 'arrays',
 'bdate_range',
 'compat',
 'concat',
 'core',
 'crosstab',
 'cut',
 'date_range',
 'datetime',
 'describe_option',
 'errors',
 'eval',
 'factorize',
 'get_dummies',
 'get_option',
 'infer_freq',
 'interval_range',
 'io',
 'isna',
 'isnull',
 'lreshape',
 'melt',
 'merge',
 'merge_asof',
 'merge_ordered',
 'notna',
 'notnull',
 'np',
 'offsets',
 'option_context',
 'options',
 'pandas',
 'period_range',
 'pivot',
 'pivot_table',
 'plotting',
 'qcut',
 'read_clipboard',
 'read_csv',
 'read_excel',
 'read_feather',
 'read_fwf',
 'read_gbq',
 'read_hdf',
 'read_html',
 'read_json',
 'read_msgpack',
 'read_parquet',
 'read_pickle',
 'read_sas',
 'read_sql',
 'read_sql_query',
 'read_sql_table',
 'read_stata',
 'read_table',
 'reset_option',
 'set_eng_float_format',
 'set_option',
 'show_versions',
 'test',
 'testing',
 'timedelta_range',
 'to_datetime',
 'to_msgpack',
 'to_numeric',
 'to_pickle',
 'to_timedelta',
 'tseries',
 'unique',
 'util',
 'value_counts',
 'wide_to_long']

# pd.read_csv 라는 기능을 발견했다면, 이 기능을 어떻게 사용할 것인지 알아보는 방법도 소개합니다.
# help 를 사용하면 해당 도구의 개발자 분들께서 만들어놓은 도움말을 볼 수 있습니다.
# pd.read_csv() 소괄호 안에 어떤 것들이 들어갈 수 있고,
# 각각의 것들이 어떠한 의미를 갖고 있는지에 대한 내용을 담고 있습니다.
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read a comma-separated values (csv) file into DataFrame.

    Also supports optionally iterating or breaking of the file
    into chunks.

    Additional help can be found in the online docs for
    `IO Tools <http://pandas.pydata.org/pandas-docs/stable/io.html>`_.

    Parameters
    ----------
    filepath_or_buffer : str, path object, or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and file. For file URLs, a host is
        expected. A local file could be: file://localhost/path/to/table.csv.

        If you want to pass in a path object, pandas accepts either
        ``pathlib.Path`` or ``py._path.local.LocalPath``.

        By file-like object, we refer to objects with a ``read()`` method, such as
        a file handler (e.g. via builtin ``open`` function) or ``StringIO``.
    sep : str, default ','
        Delimiter to use. If sep is None, the C engine cannot automatically detect
        the separator, but the Python parsing engine can, meaning the latter will
        be used and automatically detect the separator by Python's builtin sniffer
        tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
        different from ``'\s+'`` will be interpreted as regular expressions and
        will also force the use of the Python parsing engine. Note that regex
        delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``.
    delimiter : str, default ``None``
        Alias for sep.
    header : int, list of int, default 'infer'
        Row number(s) to use as the column names, and the start of the
        data.  Default behavior is to infer the column names: if no names
        are passed the behavior is identical to ``header=0`` and column
        names are inferred from the first line of the file, if column
        names are passed explicitly then the behavior is identical to
        ``header=None``. Explicitly pass ``header=0`` to be able to
        replace existing names. The header can be a list of integers that
        specify row locations for a multi-index on the columns
        e.g. [0,1,3]. Intervening rows that are not specified will be
        skipped (e.g. 2 in this example is skipped). Note that this
        parameter ignores commented lines and empty lines if
        ``skip_blank_lines=True``, so ``header=0`` denotes the first line of
        data rather than the first line of the file.
    names : array-like, optional
        List of column names to use. If file contains no header row, then you
        should explicitly pass ``header=None``. Duplicates in this list will cause
        a ``UserWarning`` to be issued.
    index_col : int, sequence or bool, optional
        Column to use as the row labels of the DataFrame. If a sequence is given, a
        MultiIndex is used. If you have a malformed file with delimiters at the end
        of each line, you might consider ``index_col=False`` to force pandas to
        not use the first column as the index (row names).
    usecols : list-like or callable, optional
        Return a subset of the columns. If list-like, all elements must either
        be positional (i.e. integer indices into the document columns) or strings
        that correspond to column names provided either by the user in `names` or
        inferred from the document header row(s). For example, a valid list-like
        `usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
        Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
        To instantiate a DataFrame from ``data`` with element order preserved use
        ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
        in ``['foo', 'bar']`` order or
        ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
        for ``['bar', 'foo']`` order.

        If callable, the callable function will be evaluated against the column
        names, returning names where the callable function evaluates to True. An
        example of a valid callable argument would be ``lambda x: x.upper() in
        ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
        parsing time and lower memory usage.
    squeeze : bool, default False
        If the parsed data only contains one column then return a Series.
    prefix : str, optional
        Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
    mangle_dupe_cols : bool, default True
        Duplicate columns will be specified as 'X', 'X.1', ...'X.N', rather than
        'X'...'X'. Passing in False will cause data to be overwritten if there
        are duplicate names in the columns.
    dtype : Type name or dict of column -> type, optional
        Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32,
        'c': 'Int64'}
        Use `str` or `object` together with suitable `na_values` settings
        to preserve and not interpret dtype.
        If converters are specified, they will be applied INSTEAD
        of dtype conversion.
    engine : {'c', 'python'}, optional
        Parser engine to use. The C engine is faster while the python engine is
        currently more feature-complete.
    converters : dict, optional
        Dict of functions for converting values in certain columns. Keys can either
        be integers or column labels.
    true_values : list, optional
        Values to consider as True.
    false_values : list, optional
        Values to consider as False.
    skipinitialspace : bool, default False
        Skip spaces after delimiter.
    skiprows : list-like, int or callable, optional
        Line numbers to skip (0-indexed) or number of lines to skip (int)
        at the start of the file.

        If callable, the callable function will be evaluated against the row
        indices, returning True if the row should be skipped and False otherwise.
        An example of a valid callable argument would be ``lambda x: x in [0, 2]``.
    skipfooter : int, default 0
        Number of lines at bottom of file to skip (Unsupported with engine='c').
    nrows : int, optional
        Number of rows of file to read. Useful for reading pieces of large files.
    na_values : scalar, str, list-like, or dict, optional
        Additional strings to recognize as NA/NaN. If dict passed, specific
        per-column NA values.  By default the following values are interpreted as
        NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
        '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan',
        'null'.
    keep_default_na : bool, default True
        Whether or not to include the default NaN values when parsing the data.
        Depending on whether `na_values` is passed in, the behavior is as follows:

        * If `keep_default_na` is True, and `na_values` are specified, `na_values`
          is appended to the default NaN values used for parsing.
        * If `keep_default_na` is True, and `na_values` are not specified, only
          the default NaN values are used for parsing.
        * If `keep_default_na` is False, and `na_values` are specified, only
          the NaN values specified `na_values` are used for parsing.
        * If `keep_default_na` is False, and `na_values` are not specified, no
          strings will be parsed as NaN.

        Note that if `na_filter` is passed in as False, the `keep_default_na` and
        `na_values` parameters will be ignored.
    na_filter : bool, default True
        Detect missing value markers (empty strings and the value of na_values). In
        data without any NAs, passing na_filter=False can improve the performance
        of reading a large file.
    verbose : bool, default False
        Indicate number of NA values placed in non-numeric columns.
    skip_blank_lines : bool, default True
        If True, skip over blank lines rather than interpreting as NaN values.
    parse_dates : bool or list of int or names or list of lists or dict, default False
        The behavior is as follows:

        * boolean. If True -> try parsing the index.
        * list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
          each as a separate date column.
        * list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
          a single date column.
        * dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call
          result 'foo'

        If a column or index cannot be represented as an array of datetimes,
        say because of an unparseable value or a mixture of timezones, the column
        or index will be returned unaltered as an object data type. For
        non-standard datetime parsing, use ``pd.to_datetime`` after
        ``pd.read_csv``. To parse an index or column with a mixture of timezones,
        specify ``date_parser`` to be a partially-applied
        :func:`pandas.to_datetime` with ``utc=True``. See
        :ref:`io.csv.mixed_timezones` for more.

        Note: A fast-path exists for iso8601-formatted dates.
    infer_datetime_format : bool, default False
        If True and `parse_dates` is enabled, pandas will attempt to infer the
        format of the datetime strings in the columns, and if it can be inferred,
        switch to a faster method of parsing them. In some cases this can increase
        the parsing speed by 5-10x.
    keep_date_col : bool, default False
        If True and `parse_dates` specifies combining multiple columns then
        keep the original columns.
    date_parser : function, optional
        Function to use for converting a sequence of string columns to an array of
        datetime instances. The default uses ``dateutil.parser.parser`` to do the
        conversion. Pandas will try to call `date_parser` in three different ways,
        advancing to the next if an exception occurs: 1) Pass one or more arrays
        (as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) the
        string values from the columns defined by `parse_dates` into a single array
        and pass that; and 3) call `date_parser` once for each row using one or
        more strings (corresponding to the columns defined by `parse_dates`) as
        arguments.
    dayfirst : bool, default False
        DD/MM format dates, international and European format.
    iterator : bool, default False
        Return TextFileReader object for iteration or getting chunks with
        ``get_chunk()``.
    chunksize : int, optional
        Return TextFileReader object for iteration.
        See the `IO Tools docs
        <http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_
        for more information on ``iterator`` and ``chunksize``.
    compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
        For on-the-fly decompression of on-disk data. If 'infer' and
        `filepath_or_buffer` is path-like, then detect compression from the
        following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no
        decompression). If using 'zip', the ZIP file must contain only one data
        file to be read in. Set to None for no decompression.

        .. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.

    thousands : str, optional
        Thousands separator.
    decimal : str, default '.'
        Character to recognize as decimal point (e.g. use ',' for European data).
    lineterminator : str (length 1), optional
        Character to break file into lines. Only valid with C parser.
    quotechar : str (length 1), optional
        The character used to denote the start and end of a quoted item. Quoted
        items can include the delimiter and it will be ignored.
    quoting : int or csv.QUOTE_* instance, default 0
        Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
        QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
    doublequote : bool, default ``True``
       When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
       whether or not to interpret two consecutive quotechar elements INSIDE a
       field as a single ``quotechar`` element.
    escapechar : str (length 1), optional
        One-character string used to escape other characters.
    comment : str, optional
        Indicates remainder of line should not be parsed. If found at the beginning
        of a line, the line will be ignored altogether. This parameter must be a
        single character. Like empty lines (as long as ``skip_blank_lines=True``),
        fully commented lines are ignored by the parameter `header` but not by
        `skiprows`. For example, if ``comment='#'``, parsing
        ``#empty\na,b,c\n1,2,3`` with ``header=0`` will result in 'a,b,c' being
        treated as the header.
    encoding : str, optional
        Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
        standard encodings
        <https://docs.python.org/3/library/codecs.html#standard-encodings>`_ .
    dialect : str or csv.Dialect, optional
        If provided, this parameter will override values (default or not) for the
        following parameters: `delimiter`, `doublequote`, `escapechar`,
        `skipinitialspace`, `quotechar`, and `quoting`. If it is necessary to
        override values, a ParserWarning will be issued. See csv.Dialect
        documentation for more details.
    tupleize_cols : bool, default False
        Leave a list of tuples on columns as is (default is to convert to
        a MultiIndex on the columns).

        .. deprecated:: 0.21.0
           This argument will be removed and will always convert to MultiIndex

    error_bad_lines : bool, default True
        Lines with too many fields (e.g. a csv line with too many commas) will by
        default cause an exception to be raised, and no DataFrame will be returned.
        If False, then these "bad lines" will dropped from the DataFrame that is
        returned.
    warn_bad_lines : bool, default True
        If error_bad_lines is False, and warn_bad_lines is True, a warning for each
        "bad line" will be output.
    delim_whitespace : bool, default False
        Specifies whether or not whitespace (e.g. ``' '`` or ``'    '``) will be
        used as the sep. Equivalent to setting ``sep='\s+'``. If this option
        is set to True, nothing should be passed in for the ``delimiter``
        parameter.

        .. versionadded:: 0.18.1 support for the Python parser.

    low_memory : bool, default True
        Internally process the file in chunks, resulting in lower memory use
        while parsing, but possibly mixed type inference.  To ensure no mixed
        types either set False, or specify the type with the `dtype` parameter.
        Note that the entire file is read into a single DataFrame regardless,
        use the `chunksize` or `iterator` parameter to return the data in chunks.
        (Only valid with C parser).
    memory_map : bool, default False
        If a filepath is provided for `filepath_or_buffer`, map the file object
        directly onto memory and access the data directly from there. Using this
        option can improve performance because there is no longer any I/O overhead.
    float_precision : str, optional
        Specifies which converter the C engine should use for floating-point
        values. The options are `None` for the ordinary converter,
        `high` for the high-precision converter, and `round_trip` for the
        round-trip converter.

    Returns
    -------
    DataFrame or TextParser
        A comma-separated values (csv) file is returned as two-dimensional
        data structure with labeled axes.

    See Also
    --------
    to_csv : Write DataFrame to a comma-separated values (csv) file.
    read_csv : Read a comma-separated values (csv) file into DataFrame.
    read_fwf : Read a table of fixed-width formatted lines into DataFrame.

    Examples
    --------
    >>> pd.read_csv('data.csv')  # doctest: +SKIP

pandas의 pivot_table 익히기

import pandas as pd
import numpy as np

#pandas 의 read_excel 기능을 이용하여 데이터 셋을 불러옵니다.
df = pd.read_excel("../data/02. sales-funnel.xlsx")
df.head()

	Account	Name	Rep	Manager	Product	Quantity	Price	Status
0	714466	Trantow-Barrows	Craig Booker	Debra Henley	CPU	1	30000	presented
1	714466	Trantow-Barrows	Craig Booker	Debra Henley	Software	1	10000	presented
2	714466	Trantow-Barrows	Craig Booker	Debra Henley	Maintenance	2	5000	pending
3	737550	Fritsch, Russel and Anderson	Craig Booker	Debra Henley	CPU	1	35000	declined
4	146832	Kiehn-Spinka	Daniel Hilton	Debra Henley	CPU	2	65000	won

# 교재에서는 index 정렬시 피벗 테이블 사용한다고 되어 있습니다.
pd.pivot_table(df,index=["Name"])

	Account	Price	Quantity
Name
Barton LLC	740150	35000	1.000000
Fritsch, Russel and Anderson	737550	35000	1.000000
Herman LLC	141962	65000	2.000000
Jerde-Hilpert	412290	5000	2.000000
Kassulke, Ondricka and Metz	307599	7000	3.000000
Keeling LLC	688981	100000	5.000000
Kiehn-Spinka	146832	65000	2.000000
Koepp Ltd	729833	35000	2.000000
Kulas Inc	218895	25000	1.500000
Purdy-Kunde	163416	30000	1.000000
Stokes LLC	239344	7500	1.000000
Trantow-Barrows	714466	15000	1.333333

# 멀티-인덱스 로도 사용이 가능합니다.
# 하지만 멀티 인덱스의 경우 순서가 중요합니다.
pd.pivot_table(df,index=["Name","Rep","Manager"])

			Account	Price	Quantity
Name	Rep	Manager
Barton LLC	John Smith	Debra Henley	740150	35000	1.000000
Fritsch, Russel and Anderson	Craig Booker	Debra Henley	737550	35000	1.000000
Herman LLC	Cedric Moss	Fred Anderson	141962	65000	2.000000
Jerde-Hilpert	John Smith	Debra Henley	412290	5000	2.000000
Kassulke, Ondricka and Metz	Wendy Yule	Fred Anderson	307599	7000	3.000000
Keeling LLC	Wendy Yule	Fred Anderson	688981	100000	5.000000
Kiehn-Spinka	Daniel Hilton	Debra Henley	146832	65000	2.000000
Koepp Ltd	Wendy Yule	Fred Anderson	729833	35000	2.000000
Kulas Inc	Daniel Hilton	Debra Henley	218895	25000	1.500000
Purdy-Kunde	Cedric Moss	Fred Anderson	163416	30000	1.000000
Stokes LLC	Cedric Moss	Fred Anderson	239344	7500	1.000000
Trantow-Barrows	Craig Booker	Debra Henley	714466	15000	1.333333

# 그 이유는 순서에 따라서 묶는 기준이 달라지기 때문입니다.
# 교재에서는 Name 까지 묶지 않았지만, Name 까지 묶어서 처리하면 아래와 같습니다.
pd.pivot_table(df,index=["Manager","Rep","Name"])

			Account	Price	Quantity
Manager	Rep	Name
Debra Henley	Craig Booker	Fritsch, Russel and Anderson	737550	35000	1.000000
	Craig Booker	Trantow-Barrows	714466	15000	1.333333
	Daniel Hilton	Kiehn-Spinka	146832	65000	2.000000
	Daniel Hilton	Kulas Inc	218895	25000	1.500000
	John Smith	Barton LLC	740150	35000	1.000000
	John Smith	Jerde-Hilpert	412290	5000	2.000000
Fred Anderson	Cedric Moss	Herman LLC	141962	65000	2.000000
		Purdy-Kunde	163416	30000	1.000000
		Stokes LLC	239344	7500	1.000000
	Wendy Yule	Kassulke, Ondricka and Metz	307599	7000	3.000000
		Keeling LLC	688981	100000	5.000000
		Koepp Ltd	729833	35000	2.000000

# 보고자 하는 값을 특별히 정해서 넣을 수 도 있습니다.
# aggregate 시 디폴트 수식은 mean 입니다.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])

		Price
Manager	Rep
Debra Henley	Craig Booker	20000.000000
	Daniel Hilton	38333.333333
	John Smith	20000.000000
Fred Anderson	Cedric Moss	27500.000000
Fred Anderson	Wendy Yule	44250.000000

# aggregate 시 사용할 함수를 직접 지정할 수도 있습니다.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)

		Price
Manager	Rep
Debra Henley	Craig Booker	80000
	Daniel Hilton	115000
	John Smith	40000
Fred Anderson	Cedric Moss	110000
Fred Anderson	Wendy Yule	177000

# aggfunc 에 들어가는 함수가 자주 쓰던 함수의 사용법과는 약간 다른 것을 알 수 있습니다.
# 함수는 만들 때 def 사용하고, 사용할 때 뒤에 소괄호를 붙였었는데, 여기에서는
# 함수(기능)을 집어넣는 것이지 실행하는 것이 아니므로 를 붙이지 않습니다.
# 이 부분은 파이썬 문법에서 클로저에 해당합니다.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])

		mean	len
		Price	Price
Manager	Rep
Debra Henley	Craig Booker	20000.000000	4
	Daniel Hilton	38333.333333	3
	John Smith	20000.000000	2
Fred Anderson	Cedric Moss	27500.000000	4
Fred Anderson	Wendy Yule	44250.000000	4

# 위에서는 인덱스(행) 을 기준으로 피벗 테이블 하는 법을 알아보았습니다.
# 여기에서는 컬럼(열) 또한 기준으로 잡을 수 있다는 것을 보여줍니다.
# 지금의 경우, 결과로 나온 데이터에 공백(NaN) 이 많이 있음을 알 수 있는데
# 이를 sparse-matrix 라는 개념으로 설명하기도 합니다.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],
               columns=["Product"],aggfunc=[np.sum])

		sum
		Price
	Product	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	65000.0	5000.0	NaN	10000.0
	Daniel Hilton	105000.0	NaN	NaN	10000.0
	John Smith	35000.0	5000.0	NaN	NaN
Fred Anderson	Cedric Moss	95000.0	5000.0	NaN	10000.0
Fred Anderson	Wendy Yule	165000.0	7000.0	5000.0	NaN

# 결측치 처리를 간단하게 하는 방법에 대해 다루고 있습니다.
# .pivot_table() 에는 null 값을 일괄 처리하는 기능이 있습니다.
# 이는 fill_value 라는 키워드로 넣어주게 됩니다.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],
               columns=["Product"],aggfunc=[np.sum],fill_value=0)

		sum
		Price
	Product	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	65000	5000	0	10000
	Daniel Hilton	105000	0	0	10000
	John Smith	35000	5000	0	0
Fred Anderson	Cedric Moss	95000	5000	0	10000
Fred Anderson	Wendy Yule	165000	7000	5000	0

# 바로 위의 코드와 다른 점은 멀티인덱스가 2 -> 3 으로 한 단계 더 들어갔습니다.
# 컬럼을 지정하지 않았습니다.
# 보고자 하는 값을 price -> price + quantity 로 하나 늘렸습니다.
pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],aggfunc=[np.sum],fill_value=0)

			sum
			Price	Quantity
Manager	Rep	Product
Debra Henley	Craig Booker	CPU	65000	2
		Maintenance	5000	2
		Software	10000	1
	Daniel Hilton	CPU	105000	4
	Daniel Hilton	Software	10000	1
	John Smith	CPU	35000	1
	John Smith	Maintenance	5000	2
Fred Anderson	Cedric Moss	CPU	95000	3
		Maintenance	5000	1
		Software	10000	1
	Wendy Yule	CPU	165000	7
		Maintenance	7000	3
		Monitor	5000	2

# 위의 코드와 다른 점은 aggregation 시 적용할 함수를 하나 더 늘리고
# margins=True 라는 옵션을 넣은 모양입니다.
# 이 때 margins 이라는 옵션이 어떠한 역할을 수행하는 지 알아보고 싶다면?
pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],
               aggfunc=[np.sum,np.mean],fill_value=0,margins=True)

			sum		mean
			Price	Quantity	Price	Quantity
Manager	Rep	Product
Debra Henley	Craig Booker	CPU	65000	2	32500	1.000000
		Maintenance	5000	2	5000	2.000000
		Software	10000	1	10000	1.000000
	Daniel Hilton	CPU	105000	4	52500	2.000000
	Daniel Hilton	Software	10000	1	10000	1.000000
	John Smith	CPU	35000	1	35000	1.000000
	John Smith	Maintenance	5000	2	5000	2.000000
Fred Anderson	Cedric Moss	CPU	95000	3	47500	1.500000
		Maintenance	5000	1	5000	1.000000
		Software	10000	1	10000	1.000000
	Wendy Yule	CPU	165000	7	82500	3.500000
		Maintenance	7000	3	7000	3.000000
		Monitor	5000	2	5000	2.000000
All			522000	30	30705	1.764706

# margins 옵션은 디폴트값이 False 임을 알 수 있습니다.
# 설명을 읽어보면, subtotal 과 grand totals 정보를 추가하는 기능임을 알 수 있습니다.
help(pd.pivot_table)

Help on function pivot_table in module pandas.core.reshape.pivot:

pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
    Create a spreadsheet-style pivot table as a DataFrame. The levels in
    the pivot table will be stored in MultiIndex objects (hierarchical
    indexes) on the index and columns of the result DataFrame.

    Parameters
    ----------
    data : DataFrame
    values : column to aggregate, optional
    index : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table index.  If an array is passed,
        it is being used as the same manner as column values.
    columns : column, Grouper, array, or list of the previous
        If an array is passed, it must be the same length as the data. The
        list can contain any of the other types (except list).
        Keys to group by on the pivot table column.  If an array is passed,
        it is being used as the same manner as column values.
    aggfunc : function, list of functions, dict, default numpy.mean
        If list of functions passed, the resulting pivot table will have
        hierarchical columns whose top level are the function names
        (inferred from the function objects themselves)
        If dict is passed, the key is column to aggregate and value
        is function or list of functions
    fill_value : scalar, default None
        Value to replace missing values with
    margins : boolean, default False
        Add all row / columns (e.g. for subtotal / grand totals)
    dropna : boolean, default True
        Do not include columns whose entries are all NaN
    margins_name : string, default 'All'
        Name of the row / column that will contain the totals
        when margins is True.

    Returns
    -------
    table : DataFrame

    See Also
    --------
    DataFrame.pivot : Pivot without aggregation that can handle
        non-numeric data.

    Examples
    --------
    >>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
    ...                          "bar", "bar", "bar", "bar"],
    ...                    "B": ["one", "one", "one", "two", "two",
    ...                          "one", "one", "two", "two"],
    ...                    "C": ["small", "large", "large", "small",
    ...                          "small", "large", "small", "small",
    ...                          "large"],
    ...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
    ...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
    >>> df
         A    B      C  D  E
    0  foo  one  small  1  2
    1  foo  one  large  2  4
    2  foo  one  large  2  5
    3  foo  two  small  3  5
    4  foo  two  small  3  6
    5  bar  one  large  4  6
    6  bar  one  small  5  8
    7  bar  two  small  6  9
    8  bar  two  large  7  9

    This first example aggregates values by taking the sum.

    >>> table = pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum)
    >>> table
    C        large  small
    A   B
    bar one      4      5
        two      7      6
    foo one      4      1
        two    NaN      6

    We can also fill missing values using the `fill_value` parameter.

    >>> table = pivot_table(df, values='D', index=['A', 'B'],
    ...                     columns=['C'], aggfunc=np.sum, fill_value=0)
    >>> table
    C        large  small
    A   B
    bar one      4      5
        two      7      6
    foo one      4      1
        two      0      6

    The next example aggregates by taking the mean across multiple columns.

    >>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': np.mean})
    >>> table
                      D         E
                   mean      mean
    A   C
    bar large  5.500000  7.500000
        small  5.500000  8.500000
    foo large  2.000000  4.500000
        small  2.333333  4.333333

    We can also calculate multiple types of aggregations for any given
    value column.

    >>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
    ...                     aggfunc={'D': np.mean,
    ...                              'E': [min, max, np.mean]})
    >>> table
                      D   E
                   mean max      mean min
    A   C
    bar large  5.500000  9   7.500000   6
        small  5.500000  9   8.500000   8
    foo large  2.000000  5   4.500000   4
        small  2.333333  6   4.333333   2

범죄 데이터 구별로 정리하기

# 여기에서 데이터 셋에 Unnamed: 0 이 있는 이유는
# 처음에 이 데이터 셋을 만들 때 저장시 index=False 옵션을 넣지 않아서
# 데이터를 저장할 때 디폴트 설정으로 인덱스가 데이터 셋에 추가되기 때문입니다.
import numpy as np
crime_anal_raw = pd.read_csv('../data/02. crime_in_Seoul_include_gu_name.csv', 
                             encoding='utf-8')
crime_anal_raw.head()

	Unnamed: 0	관서명	살인 발생	살인 검거	강도 발생	강도 검거	강간 발생	강간 검거	절도 발생	절도 검거	폭력 발생	폭력 검거	구별
0	0	중부서	2	2	3	2	105	65	1395	477	1355	1170	중구
1	1	종로서	3	3	6	5	115	98	1070	413	1278	1070	종로구
2	2	남대문서	1	0	6	4	65	46	1153	382	869	794	중구
3	3	서대문서	2	2	5	4	154	124	1812	738	2056	1711	서대문구
4	4	혜화서	3	2	5	4	96	63	1114	424	1015	861	종로구

# 그래서 이 예제에서는 index_col=0 옵션을 통해
# 데이터 셋의 컬럼 중에서 0 번째(Untitled: 0) 을 인덱스로 사용하겠다고
# 강제 지정을 하여 구현하였습니다.
crime_anal_raw = pd.read_csv('../data/02. crime_in_Seoul_include_gu_name.csv', 
                             encoding='utf-8', index_col=0)

# 이전에 다룬 피벗 테이블 활용에서 나온 문법과 동일합니다.
crime_anal = pd.pivot_table(crime_anal_raw, index='구별', aggfunc=np.sum)
crime_anal.head()

	강간 검거	강간 발생	강도 검거	강도 발생	살인 검거	살인 발생	절도 검거	절도 발생	폭력 검거	폭력 발생
구별
강남구	349	449	18	21	10	13	1650	3850	3705	4284
강동구	123	156	8	6	3	4	789	2366	2248	2712
강북구	126	153	13	14	8	7	618	1434	2348	2649
관악구	221	320	14	12	8	9	827	2706	2642	3298
광진구	220	240	26	14	4	4	1277	3026	2180	2625

# 딕셔너리에서 새로운 키에 값을 저장할 수 있는 것처럼,
# 데이터프레임에서 새로운 키(컬럼)에 값을 지정할 수 있습니다.
# 원본 데이터 프레임에서는 검거율 이라는 컬럼이 존재하지 않고
# 기존에 존재하는 정보들로부터 검거율 이라고 하는 새로운 파생변수를 만들어줍니다.
crime_anal['강간검거율'] = crime_anal['강간 검거']/crime_anal['강간 발생']*100
crime_anal['강도검거율'] = crime_anal['강도 검거']/crime_anal['강도 발생']*100
crime_anal['살인검거율'] = crime_anal['살인 검거']/crime_anal['살인 발생']*100
crime_anal['절도검거율'] = crime_anal['절도 검거']/crime_anal['절도 발생']*100
crime_anal['폭력검거율'] = crime_anal['폭력 검거']/crime_anal['폭력 발생']*100

del crime_anal['강간 검거']
del crime_anal['강도 검거']
del crime_anal['살인 검거']
del crime_anal['절도 검거']
del crime_anal['폭력 검거']

crime_anal.head()

	강간 발생	강도 발생	살인 발생	절도 발생	폭력 발생	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율
구별
강남구	449	21	13	3850	4284	77.728285	85.714286	76.923077	42.857143	86.484594
강동구	156	6	4	2366	2712	78.846154	133.333333	75.000000	33.347422	82.890855
강북구	153	14	7	1434	2649	82.352941	92.857143	114.285714	43.096234	88.637222
관악구	320	12	9	2706	3298	69.062500	116.666667	88.888889	30.561715	80.109157
광진구	240	14	4	3026	2625	91.666667	185.714286	100.000000	42.200925	83.047619

# .loc 문법에 대한 설명입니다.
# .loc[불리언, 컬럼명] = 채우고자 하는 값  입니다.
# 이번 코드의 경우 해당 컬럼의 row 값이 100 이상인 경우, 해당 컬럼의 row 값은 100으로 바꾸겠다는 의미입니다.
# 교재에서는 발생일시와 검거일시의 차이 때문에 이월된 사건이 포함된 이유라고 설명합니다.
con_list = ['강간검거율', '강도검거율', '살인검거율', '절도검거율', '폭력검거율']

for column in con_list:
    crime_anal.loc[crime_anal[column] > 100, column] = 100

crime_anal.head()

	강간 발생	강도 발생	살인 발생	절도 발생	폭력 발생	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율
구별
강남구	449	21	13	3850	4284	77.728285	85.714286	76.923077	42.857143	86.484594
강동구	156	6	4	2366	2712	78.846154	100.000000	75.000000	33.347422	82.890855
강북구	153	14	7	1434	2649	82.352941	92.857143	100.000000	43.096234	88.637222
관악구	320	12	9	2706	3298	69.062500	100.000000	88.888889	30.561715	80.109157
광진구	240	14	4	3026	2625	91.666667	100.000000	100.000000	42.200925	83.047619

# 컬럼의 이름을 바꿔주는 방법은 딕셔너리의 키워드 : 값 개념으로 적용합니다.
# 키워드 로 조회하여 값을 조회하는 딕셔너리의 특징입니다.
# 원래 있던 컬럼명을 키워드 로 사용하고, 바꾸고자 하는 내용을 값으로 집어넣습니다.
crime_anal.rename(columns = {'강간 발생':'강간', 
                             '강도 발생':'강도', 
                             '살인 발생':'살인', 
                             '절도 발생':'절도', 
                             '폭력 발생':'폭력'}, inplace=True)
crime_anal.head()

	강간	강도	살인	절도	폭력	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율
구별
강남구	449	21	13	3850	4284	77.728285	85.714286	76.923077	42.857143	86.484594
강동구	156	6	4	2366	2712	78.846154	100.000000	75.000000	33.347422	82.890855
강북구	153	14	7	1434	2649	82.352941	92.857143	100.000000	43.096234	88.637222
관악구	320	12	9	2706	3298	69.062500	100.000000	88.888889	30.561715	80.109157
광진구	240	14	4	3026	2625	91.666667	100.000000	100.000000	42.200925	83.047619

# 파이썬에서 sklearn 이라고 하는 데이터 분석을 위해 사용하는 도구가 있습니다.
# 이 도구에서 preprocessing 이라는 부분을 불러옵니다.
from sklearn import preprocessing

# 전처리 하고자 하는 컬럼들을 리스트에 넣어서, 이후 해당 컬럼만 사용하고자 할 때 이용합니다.
col = ['강간', '강도', '살인', '절도', '폭력']

# .values 는 값들만 불러오겠다는 의미입니다.
# 마찬가지로, .columns 는 컬럼(키워드) 만 가져온다는 의미입니다.
x = crime_anal[col].values

# sklearn 의 preprocessing 으로부터 MinMaxScaler 함수를 가져와 
# min_max_scaler 라는 이름으로 이 기능을 저장합니다.
min_max_scaler = preprocessing.MinMaxScaler()

# x.astype은 데이터 타입을 바꿔준다는 의미입니다.
# 여기에서는 실수(float) 로 바꿔줌을 의미합니다.
x_scaled = min_max_scaler.fit_transform(x.astype(float))

# x_scaled 는 정규화 된 값 만 들어있는 데이터입니다.
# 데이터프레임 형식이 아니기 때문에 인덱스와 컬럼명이 명시되지 않습니다.
# 이를 데이터 프레임 형식으로 바꿔주기 위한 기능이 pd.DataFrame 입니다.
crime_anal_norm = pd.DataFrame(x_scaled, columns = col, index = crime_anal.index)


# 이렇게 만든 데이터프레임에, 기존에 처리한 파생변수인 검거율을 집어넣기 위한 코드입니다.
# 검거율 정보가 전부 들어있는 리스트를 만들어 줍니다.
col2 = ['강간검거율', '강도검거율', '살인검거율', '절도검거율', '폭력검거율']

# 딕셔너리 안 새로운 키에 값을 집어넣는 방법과 정확히 동일합니다.
# 기존에 있던 데이터프레임인 crime_anal 으로부터 검거율 정보를 가져와
# 새로 만든 데이터 프레임에 키 : 값 형태로 집어넣습니다.
crime_anal_norm[col2] = crime_anal[col2]
crime_anal_norm.head()

	강간	강도	살인	절도	폭력	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율
구별
강남구	1.000000	0.941176	0.916667	0.953472	0.661386	77.728285	85.714286	76.923077	42.857143	86.484594
강동구	0.155620	0.058824	0.166667	0.445775	0.289667	78.846154	100.000000	75.000000	33.347422	82.890855
강북구	0.146974	0.529412	0.416667	0.126924	0.274769	82.352941	92.857143	100.000000	43.096234	88.637222
관악구	0.628242	0.411765	0.583333	0.562094	0.428234	69.062500	100.000000	88.888889	30.561715	80.109157
광진구	0.397695	0.529412	0.166667	0.671570	0.269094	91.666667	100.000000	100.000000	42.200925	83.047619

# 구별 CCTV 정보를 불러와 reault_CCTV 변수에 집어넣습니다.
result_CCTV = pd.read_csv('../data/01. CCTV_result.csv', encoding='UTF-8', 
                          index_col='구별')

# 새로 만들 데이터 프레임에 CCTV 정보 또한 집어넣습니다.
# 바로 위에서 검거율을 집어넣는 방법과 동일합니다.
crime_anal_norm[['인구수', 'CCTV']] = result_CCTV[['인구수', '소계']]
crime_anal_norm.head()

	강간	강도	살인	절도	폭력	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율	인구수	CCTV
구별
강남구	1.000000	0.941176	0.916667	0.953472	0.661386	77.728285	85.714286	76.923077	42.857143	86.484594	570500.0	2780
강동구	0.155620	0.058824	0.166667	0.445775	0.289667	78.846154	100.000000	75.000000	33.347422	82.890855	453233.0	773
강북구	0.146974	0.529412	0.416667	0.126924	0.274769	82.352941	92.857143	100.000000	43.096234	88.637222	330192.0	748
관악구	0.628242	0.411765	0.583333	0.562094	0.428234	69.062500	100.000000	88.888889	30.561715	80.109157	525515.0	1496
광진구	0.397695	0.529412	0.166667	0.671570	0.269094	91.666667	100.000000	100.000000	42.200925	83.047619	372164.0	707

# 모든 범죄 사실들을 구별로 더하여 범죄 라는 컬럼을 만들기 위한 코드입니다.
# 여기에서 주목한 부분은 axis=1 에 대한 것입니다.
# axis=1 은 행을 기준으로, 0 은 열을 기준으로 한다는 의미입니다.
col = ['강간','강도','살인','절도','폭력']
crime_anal_norm['범죄'] = np.sum(crime_anal_norm[col], axis=1)
crime_anal_norm.head()

	강간	강도	살인	절도	폭력	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율	인구수	CCTV	범죄
구별
강남구	1.000000	0.941176	0.916667	0.953472	0.661386	77.728285	85.714286	76.923077	42.857143	86.484594	570500.0	2780	4.472701
강동구	0.155620	0.058824	0.166667	0.445775	0.289667	78.846154	100.000000	75.000000	33.347422	82.890855	453233.0	773	1.116551
강북구	0.146974	0.529412	0.416667	0.126924	0.274769	82.352941	92.857143	100.000000	43.096234	88.637222	330192.0	748	1.494746
관악구	0.628242	0.411765	0.583333	0.562094	0.428234	69.062500	100.000000	88.888889	30.561715	80.109157	525515.0	1496	2.613667
광진구	0.397695	0.529412	0.166667	0.671570	0.269094	91.666667	100.000000	100.000000	42.200925	83.047619	372164.0	707	2.034438

# 위의 범죄 컬럼에 이어서, 검거율을 더한 검거 라는 컬럼을 만들기 위한 코드입니다.
col = ['강간검거율','강도검거율','살인검거율','절도검거율','폭력검거율']
crime_anal_norm['검거'] = np.sum(crime_anal_norm[col], axis=1)
crime_anal_norm.head()

	강간	강도	살인	절도	폭력	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율	인구수	CCTV	범죄	검거
구별
강남구	1.000000	0.941176	0.916667	0.953472	0.661386	77.728285	85.714286	76.923077	42.857143	86.484594	570500.0	2780	4.472701	369.707384
강동구	0.155620	0.058824	0.166667	0.445775	0.289667	78.846154	100.000000	75.000000	33.347422	82.890855	453233.0	773	1.116551	370.084431
강북구	0.146974	0.529412	0.416667	0.126924	0.274769	82.352941	92.857143	100.000000	43.096234	88.637222	330192.0	748	1.494746	406.943540
관악구	0.628242	0.411765	0.583333	0.562094	0.428234	69.062500	100.000000	88.888889	30.561715	80.109157	525515.0	1496	2.613667	368.622261
광진구	0.397695	0.529412	0.166667	0.671570	0.269094	91.666667	100.000000	100.000000	42.200925	83.047619	372164.0	707	2.034438	416.915211

# 전처리 단계가 전부 끝난 이후, 잘 처리되었음을 확인해 줍니다.
crime_anal_norm

	강간	강도	살인	절도	폭력	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율	인구수	CCTV	범죄	검거
구별
강남구	1.000000	0.941176	0.916667	0.953472	0.661386	77.728285	85.714286	76.923077	42.857143	86.484594	570500.0	2780	4.472701	369.707384
강동구	0.155620	0.058824	0.166667	0.445775	0.289667	78.846154	100.000000	75.000000	33.347422	82.890855	453233.0	773	1.116551	370.084431
강북구	0.146974	0.529412	0.416667	0.126924	0.274769	82.352941	92.857143	100.000000	43.096234	88.637222	330192.0	748	1.494746	406.943540
관악구	0.628242	0.411765	0.583333	0.562094	0.428234	69.062500	100.000000	88.888889	30.561715	80.109157	525515.0	1496	2.613667	368.622261
광진구	0.397695	0.529412	0.166667	0.671570	0.269094	91.666667	100.000000	100.000000	42.200925	83.047619	372164.0	707	2.034438	416.915211
구로구	0.515850	0.588235	0.500000	0.435169	0.359423	58.362989	73.333333	75.000000	38.072805	80.877951	447874.0	1561	2.398678	325.647079
금천구	0.141210	0.058824	0.083333	0.172426	0.134074	80.794702	100.000000	100.000000	56.668794	86.465433	255082.0	1015	0.589867	423.928929
노원구	0.273775	0.117647	0.666667	0.386589	0.292268	61.421320	100.000000	100.000000	36.525308	85.530665	569384.0	1265	1.736946	383.477292
도봉구	0.000000	0.235294	0.083333	0.000000	0.000000	100.000000	100.000000	100.000000	44.967074	87.626093	348646.0	485	0.318627	432.593167
동대문구	0.204611	0.470588	0.250000	0.314061	0.250887	84.393064	100.000000	100.000000	41.090358	87.401884	369496.0	1294	1.490147	412.885306
동작구	0.527378	0.235294	0.250000	0.274376	0.100024	48.771930	55.555556	100.000000	35.442359	83.089005	412520.0	1091	1.387071	322.858850
마포구	0.553314	0.529412	0.500000	0.510434	0.353748	84.013605	71.428571	100.000000	31.819961	84.445189	389649.0	574	2.446908	371.707327
서대문구	0.149856	0.000000	0.000000	0.256244	0.134547	80.519481	80.000000	100.000000	40.728477	83.219844	327163.0	962	0.540647	384.467802
서초구	0.838617	0.235294	0.500000	0.537804	0.215654	63.358779	66.666667	75.000000	41.404175	87.453105	450310.0	1930	2.327368	333.882725
성동구	0.069164	0.235294	0.166667	0.186110	0.029558	94.444444	88.888889	100.000000	37.149969	86.538462	311244.0	1062	0.686793	407.021764
성북구	0.138329	0.000000	0.250000	0.247007	0.170726	82.666667	80.000000	100.000000	41.512605	83.974649	461260.0	1464	0.806061	388.153921
송파구	0.340058	0.470588	0.750000	0.744441	0.427524	80.909091	76.923077	90.909091	34.856437	84.552352	667483.0	618	2.732611	368.150048
양천구	0.806916	0.823529	0.666667	1.000000	1.000000	77.486911	84.210526	100.000000	48.469644	83.065080	479978.0	2034	4.297113	393.232162
영등포구	0.556196	1.000000	1.000000	0.650359	0.493024	62.033898	90.909091	85.714286	32.995951	82.894737	402985.0	904	3.699580	354.547963
용산구	0.265130	0.529412	0.250000	0.169004	0.133128	89.175258	100.000000	100.000000	37.700706	83.121951	244203.0	1624	1.346674	409.997915
은평구	0.184438	0.235294	0.083333	0.291139	0.275715	84.939759	66.666667	100.000000	37.147335	86.920467	494388.0	1873	1.069920	375.674229
종로구	0.314121	0.352941	0.333333	0.383510	0.190589	76.303318	81.818182	83.333333	38.324176	84.212822	162820.0	1002	1.574494	363.991830
중구	0.195965	0.235294	0.083333	0.508040	0.174273	65.294118	66.666667	66.666667	33.712716	88.309353	133240.0	671	1.196905	320.649519
중랑구	0.244957	0.352941	0.916667	0.366746	0.321589	79.144385	81.818182	92.307692	38.829040	84.545135	414503.0	660	2.202900	376.644434

seaborn

# 파이썬의 시각화를 담당하는 가장 기초적인 도구인 
# matplotlib 의 pyplot 을 불러와 향후 plt 라는 축약어로 사용하기로 합니다.
import matplotlib.pyplot as plt

# 이 명령어가 있는 이유는 주피터 노트북에서는 
# 디폴트 값으로 화면에 그래프를 보여주지 않게 되어있는데,
# 이를 화면에 바로바로 띄워주게 만들어주기 위함입니다.
%matplotlib inline

# 파이썬에서 예쁘게 그래프를 그려주는 seaborn 을 불러와, 
# 이를 향후 sns 라는 축약어로 사용하기로 합니다.
import seaborn as sns

# np.linspace(처음, 끝, 몇 개의 간격으로 쪼갤 것인가?)
# 향후 xy 2차원 그래프에서 x축(가로축) 으로 사용합니다.
x = np.linspace(0, 14, 100)

# numpy 에서는 이공계 학부생 수준에서 나올만한 함수들은 어지간한 것들이 전부 들어있습니다.
# np.sin() 은 numpy 안에 들어있는 사인함수를 의미합니다.
# mapping 개념입니다.
y1 = np.sin(x)

# np.sin(x) 를 기준으로 아래 함수들을 해설하자면,
# 앞에 상수를 곱한 것은 사인함수의 amplitude 를 의미합니다.
# x에 상수를 더한 것은 사인함수의 phase-shift 를 의미합니다.
y2 = 2*np.sin(x+0.5)
y3 = 3*np.sin(x+1.0)
y4 = 4*np.sin(x+1.5)

# 해당 시각화의 크기를 결정하는 코드입니다.
plt.figure(figsize=(10,6))

# plt.plot() 은 간단한 lineplot 을 그리는 함수입니다.
# 같은 x 축을 공유해서 보고자 하기 때문에 x 가 동일합니다.
plt.plot(x,y1, x,y2, x,y3, x,y4)

# plt.show() 는 그래프를 보겠다는 의미이지만,
# 코드를 생략해도 그래프가 보입니다.
plt.show()

# seaborn 은 앞서 설명한 바와 같이, 예쁘게 그림을 그려주는 도구입니다.
# 따라서 테마 또한 설정할 수 있는데, 이 예제에서는 white 라는 스타일을 사용하였습니다.
# 배경색을 하얀색으로 설정한다는 의미를 갖고 있습니다.
sns.set_style("white")

plt.figure(figsize=(10,6))
plt.plot(x,y1, x,y2, x,y3, x,y4)

sns.despine()

plt.show()

# 이 예제는 배경색을 어둡게 하는 스타일을 적용한다는 의미입니다.
sns.set_style("dark")

plt.figure(figsize=(10,6))
plt.plot(x,y1, x,y2, x,y3, x,y4)
plt.show()

# R 에서의 grid 와 동일합니다.
sns.set_style("whitegrid")

plt.figure(figsize=(10,6))
plt.plot(x,y1, x,y2, x,y3, x,y4)
plt.show()

plt.figure(figsize=(10,6))
plt.plot(x,y1, x,y2, x,y3, x,y4)

sns.despine(offset=10)
'''
이렇게 생소한 기능을 보게 된다면, help() 를 이용하여 세부사항을 확인합니다.
아래는, 세부사항에서의 offset 옵션에 대한 설명입니다.
offset : int or dict, optional
        Absolute distance, in points, spines should be moved away
        from the axes (negative values move spines inward). A single value
        applies to all spines; a dict can be used to set offset values per
        side.
'''

plt.show()

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

sns.set_style("whitegrid")
%matplotlib inline

# seaborn, sklearn 등에는 데이터사이언스 입문자를 위한 데이터 셋들이 기본적으로 탑재되어 있고,
# 지금은 tips 라고 하는 데이터 셋을 불러온다는 의미입니다.
# 이 데이터 셋의 목적은, 흡연자와 비흡연자간 음식점 종업원에게 주는 팁에 유의미한 차이가 있는지를 보기 위함입니다.
tips = sns.load_dataset("tips")
tips.head(5)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

sns.set_style("whitegrid")

plt.figure(figsize=(8,6))

# tips 변수에 total_bill 컬럼을 가져와 boxplot 을 그리기 위한 재료로 집어넣습니다.
sns.boxplot(x=tips["total_bill"])
plt.show()

plt.figure(figsize=(8,6))
# boxplot 상 x 축과 y 축을 따로따로 지정해 시각화 합니다.
# x 는 nominal variable 이고, 4 가지 경우의 수가 있습니다.
# 일주일 전체의 정보가 아닌, 목 ~ 일까지의 정보만 있음을 알 수 있습니다.
sns.boxplot(x="day", y="total_bill", data=tips)
plt.show()

plt.figure(figsize=(8,6))
# hue 라는 단어는 seaborn 에서만 있는 키워드 옵션입니다.
# 한 시각화 안에서 차원을 추가하는 역할을 수행합니다.(nominal variable을 추가합니다.)
sns.boxplot(x="day", y="total_bill", hue="smoker", data=tips, palette="Set3")
plt.show()

plt.figure(figsize=(8,6))
# 값의 크기에 해당하는 위치에 점을 찍어 표현하는 시각화의 경우
# 데이터가 몰려있을 때 점 끼리 겹치게 되므로 정보에 대한 설명력이 떨어지는 단점이 있습니다.
# swarmplot 은 이 문제를 해결하기 위해 직교벡터에 해당하는 차원을 하나 추가하여
# 데이터가 몰려있는 경우 직교성분 값을 주어 데이터 쏠림에 의한 설명력 저하를 막았습니다.
sns.swarmplot(x="day", y="total_bill", data=tips, color=".5")
plt.show()

# 이 swarmplot 을 boxplot 과 겹쳐 그려 시각화 하는 코드 입니다.
# 박스플롯은 quantile 과 outlier 정보만 들어있지만
# swarmplot 은 데이터의 분포 또한 알 수 있기 때문에 결합한 것입니다.
plt.figure(figsize=(8,6))
sns.boxplot(x="day", y="total_bill", data=tips)
sns.swarmplot(x="day", y="total_bill", data=tips, color=".25")
plt.show()

# 코드 내용이 중복이므로, 스킵합니다.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

tips = sns.load_dataset("tips")
tips.head(5)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

sns.set_style("darkgrid")
# sns.lmplot 은 산점도를 그리고, regression 을 해 주는 그래프 입니다.
# 교재에서는 ci 를 잡아준다고 했지만 코드에 따로 명시되지는 않았습니다.
# ci(Confidence Interval) 은 디폴트 값으로 95가 들어있기 때문입니다.
sns.lmplot(x="total_bill", y="tip", data=tips, size=7)
plt.show()

C:\Users\one\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\regression.py:546: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

# 이를 nominal variable 인 smoker 로 차원을 하나 더 넣어서 볼 수도 있습니다.
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips, size=7)
plt.show()

# palette 는 lmplot 의 색상 스타일을 지정하는 옵션입니다.
# https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips, palette="Set1", size=7)
plt.show()

# 공식 문서 내용:
# Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
# 유니폼 분포를 생성하는 함수이고, (10, 12) 는 해당 분포를 10 x 12 dimension 으로 만든다는 의미입니다.
uniform_data = np.random.rand(10, 12)
uniform_data

array([[0.44289922, 0.74070318, 0.54430557, 0.95324863, 0.28491956,
        0.3324923 , 0.26183489, 0.78928935, 0.85443942, 0.57513012,
        0.92212344, 0.29698457],
       [0.93276987, 0.15426644, 0.46403145, 0.00764703, 0.03365418,
        0.95575039, 0.27753383, 0.63527261, 0.12261691, 0.99832323,
        0.84495932, 0.95602654],
       [0.77836787, 0.22402844, 0.45109024, 0.8811308 , 0.27954441,
        0.585003  , 0.7669939 , 0.08605604, 0.60796336, 0.78048386,
        0.35441648, 0.34201032],
       [0.35763842, 0.79673878, 0.61720219, 0.7025775 , 0.6400613 ,
        0.28625361, 0.31384789, 0.27929707, 0.76358424, 0.77930314,
        0.0121419 , 0.45374163],
       [0.91502203, 0.00219876, 0.41822037, 0.93349022, 0.67965994,
        0.64988312, 0.43815436, 0.29666868, 0.98494101, 0.99493227,
        0.48429262, 0.39925491],
       [0.4242073 , 0.76535252, 0.58353285, 0.65333952, 0.60155998,
        0.45358268, 0.23806388, 0.6165977 , 0.99858016, 0.94730931,
        0.94207421, 0.62766515],
       [0.89191989, 0.77267777, 0.8972766 , 0.43356719, 0.36439075,
        0.09795348, 0.5752986 , 0.58954244, 0.47617896, 0.05143237,
        0.49931986, 0.6602905 ],
       [0.02159631, 0.5555326 , 0.17136604, 0.44541612, 0.85281145,
        0.55620004, 0.16055769, 0.32305168, 0.01462843, 0.8428261 ,
        0.40800116, 0.05890537],
       [0.6946162 , 0.33539753, 0.75009131, 0.638539  , 0.95247645,
        0.14217928, 0.23211685, 0.94761891, 0.61435438, 0.85659537,
        0.29738112, 0.72464081],
       [0.61966785, 0.04172887, 0.74994317, 0.26119473, 0.25354159,
        0.390777  , 0.47118477, 0.53452897, 0.59617963, 0.22804767,
        0.52866891, 0.84821592]])

# seaborn 에서는 열 분포 시각화 방식을 기본으로 제공합니다.
# 유니폼 분포가 10 행 12 열 이었기 때문에, 히트맵 또한 같은 dimension 으로 나왔습니다.
sns.heatmap(uniform_data)
plt.show()

# seaborn document 상 vmin, vmax 내용:
# Values to anchor the colormap, otherwise they are inferred from the data and other keyword arguments.
# scale 상 window 설정을 위한 옵션입니다.
sns.heatmap(uniform_data, vmin=0, vmax=1)
plt.show()

# seaborn 으로부터 flights 데이터 셋을 불러와, flights 변수에 집어넣습니다.
flights = sns.load_dataset("flights")
flights.head(5)

	year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121

# index, column, value 값을 각각 넣어서 피벗 테이블을 만들었습니다.
# 이 방법은 다양한 분야에서 많이 사용이 되는데, Collaborative Filtering 추천 시스템 또한
# 이를 사용하면 쉽게 구현이 가능합니다.
flights = flights.pivot("month", "year", "passengers")
flights.head(5)

year	1949	1950	1951	1952	1953	1954	1955	1956	1957	1958	1959	1960
month
January	112	115	145	171	196	204	242	284	315	340	360	417
February	118	126	150	180	196	188	233	277	301	318	342	391
March	132	141	178	193	236	235	267	317	356	362	406	419
April	129	135	163	181	235	227	269	313	348	348	396	461
May	121	125	172	183	229	234	270	318	355	363	420	472

plt.figure(figsize=(10,8))
# sns.heatmap() 에 자주 들어가는 것은 피벗 테이블 한 결과입니다.
# 위의 피벗 테이블 결과를 히트맵 방식으로 시각화 하였습니다.
sns.heatmap(flights)
plt.show()

plt.figure(figsize=(10,8))

# sns.heatmap document:

# fmt 란? 
# String formatting code to use when adding annotations.
# d 는 decimal(10진수) 이라는 것입니다.

# annot 란? 
# If True, write the data value in each cell. 
# If an array-like with the same shape as data, 
# then use this to annotate the heatmap instead of the raw data.
# 각 네모의 안에 네모에 해당하는 값을 적어주는 옵션입니다.

sns.heatmap(flights, annot=True, fmt="d")
plt.show()

sns.set(style="ticks")
# seaborn 으로부터 iris 데이터 셋을 불러옵니다.
iris = sns.load_dataset("iris")
iris.head(10)

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa
6	4.6	3.4	1.4	0.3	setosa
7	5.0	3.4	1.5	0.2	setosa
8	4.4	2.9	1.4	0.2	setosa
9	4.9	3.1	1.5	0.1	setosa

# 모든 컬럼들에 대해 산점도 행렬을 그려주는 기능입니다.
# 각 컬럼들 간의 상호 상관관계를 확인할 수 있습니다.
sns.pairplot(iris)
plt.show()

# hue 옵션으로 nominal variable 을 하나 더 추가해서
# 한 차원을 더 넣어 시각화 합니다.
sns.pairplot(iris, hue="species")
plt.show()

# 데이터 크기가 큰 경우, pairplot 시 시간이 너무 오래 걸리는 경우가 있습니다.
# 이럴 때는, 보고자 하는 컬럼들만 선택해 집어넣어서 해결할 수 있습니다.
sns.pairplot(iris, vars=["sepal_width", "sepal_length"])
plt.show()

# x 축과 y 축을 따로따로 지정해 산점도 행렬을 구할 수도 있습니다.
# 기존에는, 대각행렬 성분이 자기 자신과의 비교이기 때문에 값의 분포를 나타내는 정보이었는데
# 지금은 대각행렬 성분이 자기 자신과의 비교가 아니게 됩니다.
sns.pairplot(iris, x_vars=["sepal_width", "sepal_length"], 
             y_vars=["petal_width", "petal_length"])
plt.show()

# anscombe 데이터 셋을 불러와 향후 anscombe 라는 변수로 사용하기로 합니다.
anscombe = sns.load_dataset("anscombe")
anscombe.head(5)

	dataset	x	y
0	I	10.0	8.04
1	I	8.0	6.95
2	I	13.0	7.58
3	I	9.0	8.81
4	I	11.0	8.33

sns.set_style("darkgrid")
# 여기에서는 ci(Confidence Interval) 을 None 으로 설정함으로서,
# 신뢰구간을 의도적으로 지워서 표현한 코드입니다.
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'I'"),  ci=None, size=7)
plt.show()

# scatter_kws 는 산점도를 그릴 때 점의 모양과 크기를 넣어주는 옵션입니다.
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'I'"),
           ci=None, scatter_kws={"s": 80}, size=7)
plt.show()

# 이번 데이터는 선형관계가 아닌 curvilinear 관계일 경우를 나타냅니다.
# 따라서, 선형회귀를 할 수 없음을 보여주고 있습니다.
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),
           order=1, ci=None, scatter_kws={"s": 80}, size=7)
plt.show()

# order 옵션을 통해 2차 함수로 fitting 한 코드입니다.
# 일반 1차함수 fitting 이 기본값이었고, 이를 옵션을 통해 2로 바꾼 모습입니다.
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),
           order=2, ci=None, scatter_kws={"s": 80}, size=7)
plt.show()

sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
           ci=None, scatter_kws={"s": 80}, size=7)
plt.show()

sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
           robust=True, ci=None, scatter_kws={"s": 80}, size=7)
plt.show()

Visualization using seaborn

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import platform

# matplotlib 은 기본적으로 영어이고, 한글 정보가 들어가면 
# 폰트가 깨지기 때문에 matplotlib 의 
# font_manager 와 rc 를 불러와 폰트를 설정하는 코드입니다.
path = "c:/Windows/Fonts/malgun.ttf"
from matplotlib import font_manager, rc
if platform.system() == 'Darwin':
    rc('font', family='AppleGothic')
elif platform.system() == 'Windows':
    font_name = font_manager.FontProperties(fname=path).get_name()
    rc('font', family=font_name)
else:
    print('Unknown system... sorry~~~~')

crime_anal_norm.head()

	강간	강도	살인	절도	폭력	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율	인구수	CCTV	범죄	검거
구별
강남구	1.000000	0.941176	0.916667	0.953472	0.661386	77.728285	85.714286	76.923077	42.857143	86.484594	570500.0	2780	4.472701	369.707384
강동구	0.155620	0.058824	0.166667	0.445775	0.289667	78.846154	100.000000	75.000000	33.347422	82.890855	453233.0	773	1.116551	370.084431
강북구	0.146974	0.529412	0.416667	0.126924	0.274769	82.352941	92.857143	100.000000	43.096234	88.637222	330192.0	748	1.494746	406.943540
관악구	0.628242	0.411765	0.583333	0.562094	0.428234	69.062500	100.000000	88.888889	30.561715	80.109157	525515.0	1496	2.613667	368.622261
광진구	0.397695	0.529412	0.166667	0.671570	0.269094	91.666667	100.000000	100.000000	42.200925	83.047619	372164.0	707	2.034438	416.915211

# 강도, 살인, 폭력에 대해 산점도 행렬을 구하는 코드입니다.
# kind='reg' 는 regression line 을 그리겠다는 옵션입니다.
sns.pairplot(crime_anal_norm, vars=["강도", "살인", "폭력"], kind='reg', size=3)
plt.show()

C:\Users\one\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
  warnings.warn(msg, UserWarning)
C:\Users\one\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 8722 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\one\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:180: RuntimeWarning: Glyph 8722 missing from current font.
  font.set_text(s, 0, flags=flags)

# x 축과 y 축에 대해 따로따로 값을 넣어서
# 산점도 행렬을 구합니다.
# 대각행렬 성분이 위와는 다름을 알 수 있습니다.
sns.pairplot(crime_anal_norm, x_vars=["인구수", "CCTV"], 
             y_vars=["살인", "강도"], kind='reg', size=3)
plt.show()

sns.pairplot(crime_anal_norm, x_vars=["인구수", "CCTV"], 
             y_vars=["살인검거율", "폭력검거율"], kind='reg', size=3)
plt.show()

sns.pairplot(crime_anal_norm, x_vars=["인구수", "CCTV"], 
             y_vars=["절도검거율", "강도검거율"], kind='reg', size=3)
plt.show()

# 검거 라는 컬럼은 애초에 모든 범죄에 대해 더하기를 해서 얻은 지표입니다.
# 이는 정규화가 이루어지지 않았기 때문에 이 최대치를 100 으로 바꿔주는 코드입니다.
tmp_max = crime_anal_norm['검거'].max()
crime_anal_norm['검거'] = crime_anal_norm['검거'] / tmp_max * 100
crime_anal_norm_sort = crime_anal_norm.sort_values(by='검거', ascending=False)
crime_anal_norm_sort.head()

	강간	강도	살인	절도	폭력	강간검거율	강도검거율	살인검거율	절도검거율	폭력검거율	인구수	CCTV	범죄	검거
구별
도봉구	0.000000	0.235294	0.083333	0.000000	0.000000	100.000000	100.0	100.0	44.967074	87.626093	348646.0	485	0.318627	100.000000
금천구	0.141210	0.058824	0.083333	0.172426	0.134074	80.794702	100.0	100.0	56.668794	86.465433	255082.0	1015	0.589867	97.997139
광진구	0.397695	0.529412	0.166667	0.671570	0.269094	91.666667	100.0	100.0	42.200925	83.047619	372164.0	707	2.034438	96.375820
동대문구	0.204611	0.470588	0.250000	0.314061	0.250887	84.393064	100.0	100.0	41.090358	87.401884	369496.0	1294	1.490147	95.444250
용산구	0.265130	0.529412	0.250000	0.169004	0.133128	89.175258	100.0	100.0	37.700706	83.121951	244203.0	1624	1.346674	94.776790

target_col = ['강간검거율', '강도검거율', '살인검거율', '절도검거율', '폭력검거율']

crime_anal_norm_sort = crime_anal_norm.sort_values(by='검거', ascending=False)

plt.figure(figsize = (10,10))

# cmap 이라는 옵션은 컬러 맵인데, 가장 큰 값과 가장 작은 값에 어떤 색을 대입할 지 결정합니다.
sns.heatmap(crime_anal_norm_sort[target_col], annot=True, fmt='f', 
                    linewidths=.5, cmap='RdPu')
plt.title('범죄 검거 비율 (정규화된 검거의 합으로 정렬)')
plt.show()

target_col = ['강간', '강도', '살인', '절도', '폭력', '범죄']

# 여기서 5 로 나눠준 이유는 범죄의 종류가 5 가지 이기 때문입니다.
crime_anal_norm['범죄'] = crime_anal_norm['범죄'] / 5
crime_anal_norm_sort = crime_anal_norm.sort_values(by='범죄', ascending=False)

plt.figure(figsize = (10,10))
sns.heatmap(crime_anal_norm_sort[target_col], annot=True, fmt='f', linewidths=.5,
                       cmap='RdPu')
plt.title('범죄비율 (정규화된 발생 건수로 정렬)')
plt.show()

# to_csv 를 이용하여 데이터를 csv 파일로 저장해 줍니다.
crime_anal_norm.to_csv('../data/02. crime_in_Seoul_final.csv', sep=',', 
                       encoding='utf-8')

Folium

import folium
# 파이썬에서 지도 시각화 도구인 folium 을 불러옵니다.

# folium 에는 Map 이라는 도구가 있는데, 여기에 위도/경도를 넣어주면
# 해당 좌표의 지도를 가져와 보여줍니다.
map_osm = folium.Map(location=[45.5236, -122.6750])
map_osm

# zoom_start 라는 옵션으로 지도의 확대 / 축소가 가능합니다.
stamen = folium.Map(location=[45.5236, -122.6750], zoom_start=13)
stamen

# tiles 옵션은 지도에 테마를 입히는 기능을 수행합니다.
# folium document:
# https://python-visualization.github.io/folium/modules.html
stamen = folium.Map(location=[45.5236, -122.6750], tiles='Stamen Toner', 
                    zoom_start=13)
stamen

stamen = folium.Map(location=[45.5236, -122.6750], 
                    tiles='Stamen Terrain', zoom_start=13)
stamen

map_1 = folium.Map(location=[45.372, -121.6972], zoom_start=12,
                   tiles='Stamen Terrain')
# 해당 지도에 마코를 표시하는 기능입니다.
# popup 은 마우스와 interaction 시 필요한 정보입니다.
# icon 은 어떤 아이콘을 사용할지 입니다.
# .addto() 함수를 이용해 어떤 지도에 넣을지를 설정합니다.
folium.Marker([45.3288, -121.6625], popup='Mt. Hood Meadows', 
              icon=folium.Icon(icon='cloud')).add_to(map_1)
folium.Marker([45.3311, -121.7113], popup='Timberline Lodge', 
              icon=folium.Icon(icon='cloud')).add_to(map_1)
map_1

map_1 = folium.Map(location=[45.372, -121.6972], zoom_start=12, 
                   tiles='Stamen Terrain')
folium.Marker([45.3288, -121.6625], popup='Mt. Hood Meadows', 
              icon=folium.Icon(icon='cloud')).add_to(map_1)
folium.Marker([45.3311, -121.7113], popup='Timberline Lodge', 
              icon=folium.Icon(color='green')).add_to(map_1)
folium.Marker([45.3300, -121.6823], popup='Some Other Location', 
              icon=folium.Icon(color='red',icon='info-sign')).add_to(map_1)
map_1

map_2 = folium.Map(location=[45.5236, -122.6750], tiles='Stamen Toner', 
                   zoom_start=13)
folium.Marker([45.5244, -122.6699], popup='The Waterfront' ).add_to(map_2)

# circleMarker 는 지역을 원으로 표기하는 기능을 수행합니다.
folium.CircleMarker([45.5215, -122.6261], radius=50, 
                    popup='Laurelhurst Park', color='#3186cc', 
                    fill_color='#3186cc', ).add_to(map_2)
map_2

map_5 = folium.Map(location=[45.5236, -122.6750], zoom_start=13)

# 원이 아니라 n각형 마커도 만들 수 있습니다.
# n각형을 결정하는 옵션은 number_of_sides 입니다.
folium.RegularPolygonMarker([45.5012, -122.6655], 
                            popup='Ross Island Bridge', fill_color='#132b5e', 
                            number_of_sides=3, radius=10).add_to(map_5)
folium.RegularPolygonMarker([45.5132, -122.6708], 
                            popup='Hawthorne Bridge', fill_color='#45647d', 
                            number_of_sides=4, radius=10).add_to(map_5)
folium.RegularPolygonMarker([45.5275, -122.6692], 
                            popup='Steel Bridge', fill_color='#769d96', 
                            number_of_sides=6, radius=10).add_to(map_5)
folium.RegularPolygonMarker([45.5318, -122.6745], 
                            popup='Broadway Bridge', fill_color='#769d96', 
                            number_of_sides=8, radius=10).add_to(map_5)
map_5

import folium
import pandas as pd

# 미국의 실업률 정보가 있는 데이터 세트의 경로를 상대경로 방식으로 가져옵니다.
state_unemployment = '../data/02. folium_US_Unemployment_Oct2012.csv'

# 미국의 실업률 정보를 가져와 state_unimployment 변수에 집어넣습니다.
state_data = pd.read_csv(state_unemployment)
state_data.head()

	State	Unemployment
0	AL	7.1
1	AK	6.8
2	AZ	8.1
3	AR	7.2
4	CA	10.1

# 미국 지역정보가 있는 json(딕셔너리와 동일합니다.) 경로를 상대경로 방식으로 가져옵니다.
state_geo = '../data/02. folium_us-states.json'

# 위도/경도가 40, 98인 곳의 지도를 가져와 zoom 은 4로 놓고 map 변수에 집어넣습니다. 
map = folium.Map(location=[40, -98], zoom_start=4)

# 지도에 구역정보인 state_geo 를 넣고
# 구역별 실업률 정보인 state_data 를 넣어줍니다.
# 여기에서 사용할 데이터는 state_data 에서의 State 와 Unimployment 컬럼 입니다.
# key_on 은 두 데이터 셋을 join 할 때 사용 할 key 입니다.
# fill_color 는 어떤 색깔로 경/중을 표기할 지를 결정합니다.
map.choropleth(geo_data=state_geo, data=state_data,
             columns=['State', 'Unemployment'],
             key_on='feature.id',
             fill_color='YlGn',
             legend_name='Unemployment Rate (%)')
map

C:\Users\one\AppData\Local\Continuum\anaconda3\lib\site-packages\folium\folium.py:415: FutureWarning: The choropleth  method has been deprecated. Instead use the new Choropleth class, which has the same arguments. See the example notebook 'GeoJSON_and_choropleth' for how to do this.
  FutureWarning

범죄율에 대한 지도 시각화

import json

# 우리나라의 구역정보가 들어있는 geo-json 데이터의 상대경로를 지정합니다.
geo_path = '../data/02. skorea_municipalities_geo_simple.json'

# json 파일을 불러와 딕셔너리로 사용하기 위해 json.load() 를 사용합니다.
# 이 이유는 json 은 딕셔너리와 동일하게 생겼지만, 문자열로 저장되어 있기 때문입니다.
geo_str = json.load(open(geo_path, encoding='utf-8'))

map = folium.Map(location=[37.5502, 126.982], zoom_start=11, 
                 tiles='Stamen Toner')

map.choropleth(geo_data = geo_str,
               data = crime_anal_norm['살인'],
               columns = [crime_anal_norm.index, crime_anal_norm['살인']],
               fill_color = 'PuRd', #PuRd, YlGnBu
               key_on = 'feature.id')
map

map = folium.Map(location=[37.5502, 126.982], zoom_start=11, 
                 tiles='Stamen Toner')

map.choropleth(geo_data = geo_str,
               data = crime_anal_norm['강간'],
               columns = [crime_anal_norm.index, crime_anal_norm['강간']],
               fill_color = 'PuRd', #PuRd, YlGnBu
               key_on = 'feature.id')
map

map = folium.Map(location=[37.5502, 126.982], zoom_start=11, 
                 tiles='Stamen Toner')

map.choropleth(geo_data = geo_str,
               data = crime_anal_norm['범죄'],
               columns = [crime_anal_norm.index, crime_anal_norm['범죄']],
               fill_color = 'PuRd', #PuRd, YlGnBu
               key_on = 'feature.id')
map

tmp_criminal = crime_anal_norm['살인'] /  crime_anal_norm['인구수'] * 1000000

map = folium.Map(location=[37.5502, 126.982], zoom_start=11, 
                 tiles='Stamen Toner')

map.choropleth(geo_data = geo_str,
               data = tmp_criminal,
               columns = [crime_anal.index, tmp_criminal],
               fill_color = 'PuRd', #PuRd, YlGnBu
               key_on = 'feature.id')
map

tmp_criminal = crime_anal_norm['범죄'] /  crime_anal_norm['인구수'] * 1000000

map = folium.Map(location=[37.5502, 126.982], zoom_start=11, 
                 tiles='Stamen Toner')

map.choropleth(geo_data = geo_str,
               data = tmp_criminal,
               columns = [crime_anal.index, tmp_criminal],
               fill_color = 'PuRd', #PuRd, YlGnBu
               key_on = 'feature.id')
map

map = folium.Map(location=[37.5502, 126.982], zoom_start=11, 
                 tiles='Stamen Toner')

map.choropleth(geo_data = geo_str,
               data = crime_anal_norm['검거'],
               columns = [crime_anal_norm.index, crime_anal_norm['검거']],
               fill_color = 'YlGnBu', #PuRd, YlGnBu
               key_on = 'feature.id')
map

'[입문] 데이터 사이언스? 그게 뭔가요?' 카테고리의 다른 글

의사결정나무 기반 알고리즘 논문 리딩자료 (1)	2020.08.21
파이썬으로 데이터 주무르기 4장 코드에 주석을 달았다는 내용의 제목 (0)	2019.10.15
[뉴스 정보] 조금 진지한 크롤링, selenium / beautifulsoup (0)	2019.08.21
[뉴스 정보] 데이터 수집/저장 입문_2 (0)	2019.08.21
[환율 정보] 데이터 수집/저장 입문_1 (0)	2019.08.21

PassionPython

파이썬으로 데이터 주무르기 2장 코드에 주석을 달았다는 내용의 제목

강남 3구는 안전한가?

9월 11일 수요일 강의 발표에 대한 부분

교재에 없는 내용이 포함되어 있습니다.

원본 소스코드에 없는 코드와 주석이 포함되어 있습니다.

데이터 정리하기

pandas의 pivot_table 익히기

범죄 데이터 구별로 정리하기

seaborn

Visualization using seaborn

Folium

범죄율에 대한 지도 시각화

'[입문] 데이터 사이언스? 그게 뭔가요?' 카테고리의 다른 글

티스토리툴바

파이썬으로 데이터 주무르기 2장 코드에 주석을 달았다는 내용의 제목

강남 3구는 안전한가?

9월 11일 수요일 강의 발표에 대한 부분

교재에 없는 내용이 포함되어 있습니다.

원본 소스코드에 없는 코드와 주석이 포함되어 있습니다.

데이터 정리하기

pandas의 pivot_table 익히기

범죄 데이터 구별로 정리하기

seaborn

Visualization using seaborn

Folium

범죄율에 대한 지도 시각화

'[입문] 데이터 사이언스? 그게 뭔가요?' 카테고리의 다른 글

'[입문] 데이터 사이언스? 그게 뭔가요?' Related Articles

티스토리툴바