Pandas tutorial: Indexing and Selecting Data

.loc

is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.

The .loc attribute is the primary access method. The following are valid inputs:

  • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is not an integer position along the index)
  • A list or array of labels ['a', 'b', 'c']
  • A slice object with labels 'a':'f' (note that contrary to usual python slices, both the start and the stop are included, when present in the index!)
  • A boolean array
  • A callable
.loc

NOTE

This will not modify df because the column alignment is before value assignment.

df.loc[:,['B', 'A']] = df[['A', 'B']]

The correct way is to use raw values

df.loc[:,['B', 'A']] = df[['A', 'B']].values

or just,

df[['B', 'A']] = df[['A', 'B']]

.iloc

is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

You can also assign a dict to a row of a DataFrame:

In [28]: x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})

In [29]: x.iloc[1] = dict(x=9, y=99)

In [30]: x
Out[30]: 
   x   y
0  1   3
1  9  99
2  3   5

Allowed inputs are:

  • An integer

  • A list or array of integers [4, 3, 0]

  • A slice object with ints 1:7

  • A boolean array

  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

slicing

  • []

With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels

With DataFrame, slicing inside of [] slices the rows.

[start: end: step]
[ ]

Fast scalar value getting and setting

If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures.

Similarly to loc, at provides label based scalar lookups, while, iat provides integer based lookups analogously to iloc.

Boolean indexing

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

isin

Consider the isin method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want

Sample

where()

Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.

Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. where is used under the hood as the implementation. Equivalent is df.where(df < 0)

mask

mask is the inverse boolean operation of where.

Duplicate Data

  • duplicated

  • drop_duplicates

Dictionary-like get() method

lookup()

index object

set_index()

Reset the index

As a convenience, there is a new function on DataFrame called reset_index which transfers the index values into the DataFrame’s columns and sets a simple integer index. This is the inverse operation to set_index

Returning a view versus a copy

chained indexing ?

MultiIndex (hierarchical index)

You can think of MultiIndex as an array of tuples where each tuple is unique.

All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned.

the level labels

The method get_level_values will return a vector of the labels for each location at a particular level.

One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame.

Data alignment and using reindex

Using slicers

pd.IndexSlice

Cross-section

The xs method of DataFrame additionally takes a level argument to make selecting data at a particular level of a MultiIndex easier.

Alignment

  • align

Swapping levels

The swaplevel function can switch the order of two levels

Reordering levels

The reorder_levels function generalizes the swaplevel function, allowing you to permute the hierarchical index levels in one step

Sorting a MultiIndex

.sort_index

The is_lexsorted() method on an Index show if the index is sorted, and the lexsort_depth property returns the sort depth.

Take Methods

.take

Index Types

  • MultiIndex
  • DatetimeIndex and PeriodIndex
  • TimedeltaIndex
  • CategoricalIndex
  • Int64Index and RangeIndex
  • Float64Index
  • IntervalIndex

Refenrence

  • MultiIndex / Advanced Indexing?

  • Indexing and Selecting Data?

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。