Python | pandas入門介紹(表格型數據的處理利器)

pandas庫是用Python進行數據分析絕對會使用到的一個第三方庫,因此無論如何你都必須要了解它,本文是對pandas庫官方文檔中對pandas庫介紹部分的翻譯,學習編程時,學會閱讀和使用官方文檔是解決問題最直接也是最靠譜的方法,因此建議?如果真想在編程上有所為,一定要去閱讀官方文檔,否則一直只能吃別人嚼過的東西。

pandas庫官方文檔地址:http://pandas.pydata.org/pandas-docs/stable/

注:由于英語水平有限,難免會有錯誤,如發現,請留言指正。

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fu ndamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas庫是Python的第三方庫,它提供快速,靈活并且富有表達能力的數據結構,這些數據結構讓我們能夠更加容易和直觀的處理關系型和代帶標簽的數據。它致力于成為在Python中對真實世界進行數據分析的基礎高層次構建模塊。不僅如此,它還有一個更加遠大的目標,那就是在任何編程語言中,成為一個最強大,最靈活的開源數據處理與分析的工具。它現在正在積極的向它的目標邁進!

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

pandas適合多種不同種類的數據

  • 具有異構類型列的表格數據(不同列的數據類型不相同,如有的列為字符型,有的為數字,有的為時間日期),如SQL表或Excel電子表格數據;
  • 有序和無序(且不一定是固定頻率)的時間序列數據;
  • 具有行和列標簽的任意矩陣數據(同質或者異構,不同列數據類型相同,則同質,否則異構);
  • 任何其他形式的觀察/統計數據集,這些數據在被存放到pandas的數據結構當中時,不需要打上標簽;

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

pandas 最重要的兩種數據結構是Series(一維)和DataFrame(二維),這兩種數據結構能夠應對大多數金融,統計,工程領域的數據處理需求。對于R的使用者而言,DataFrame提供的功能不僅包括了R’s data.frame所能提供的一切,還包括了一些R’s data.frame所沒有的功能。pandas庫構建在Numpy庫之上,被科學計算領域的很多第三方庫集成。

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

這里列出pandas庫很擅長的一些事情:

  • 輕松處理浮點型和非浮點型數據的缺失值;
  • 大小可變:數據列能夠從DataFrame或者更高維度的數據結構中添加或者刪除;
  • 自動和明確的數據對齊:對象可以顯式對齊一組標簽,或者用戶可以簡單地忽略標簽,并讓Series
    DataFrame在計算時,自動對齊你的數據;
  • 功能強大,靈活的group by函數,可用于對數據集執行拆分,應用,組合操作,用于聚合和轉換數據;
  • 使其他Python和NumPy數據結構中不規整,不同索引的數據轉換成DataFrame對象變得容易;
  • 基于標簽的自動切片,索引和大數據集子集選取;
  • 直觀的數據融合和連接操作;
  • 靈活的對數據表進行結構重構,或者進行數據透視;
  • 軸的分層標簽(每個刻度可能有多個標簽);
  • 用于從文本文件(CSV和分隔符),Excel文件,數據庫以及從超快HDF5格式文件加載數據的強大的IO工具;
  • 時間序列特定功能:日期范圍生成和變頻,滑動窗口統計,滑動窗口線性回歸,日期偏移和滯后等。

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

以上提到的很多的pandas的特點是為了解決其他語言/科學研究環境常有的一些缺點。對于數據科學家而言,通常數據分析工作分為幾個階段:清洗和整理數據,分析和建模,然后把結果整理成用于展示的圖表。pandas是做這些工作的理想工具。(也就是pandas從清洗數據,到最后的結果展現階段都會用到)

Some other notes

  • pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However, as with anything else generalization usually sacrifices performance. So if you focus on one feature for your application you may be able to create a faster specialized tool.
  • pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.
  • pandas has been used extensively in production in financial applications.

一些關鍵點:

  • pandas速度很快。許多低水平算法已用Cython代碼來編寫。然而,pandas和其它任何關注一般化而犧牲一定性能的工具一樣,因此如果您專注于應用程序的一個功能,您可以創建一個更快的專業工具。(也就是說pandas為了功能的全面,能應付更加一般化的數據處理工作,而不是某個特定的數據處理過程,它的效率基本只能是現在這個樣子了,如果你的數據處理需求很固定,可以考慮自己用C或者c++等語言開發速度更快但只適合特定場景的工具)
  • pandas是統計模型的依賴庫,它是Python計算生態中的重要一部分;
  • pandas已經廣泛應用于金融領域的生產環境,

Note: This documentation assumes general familiarity with NumPy. If you haven’t used NumPy much or at all, do invest some time in learning about NumPy first.

注意:pandas的文檔內容會假定你對NumPy庫已經熟悉了,如果你沒有使用過NumPy,先花一些時間去學習它吧~

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容