Complete guide towards NumPy | Python for Data Science

Updated: May 25



NumPy- Numerical python package is the first step towards your data science journey is a fundamental package for scientific computing with python. What makes NumPy special is it brings the computational power of languages like C and Fortran to Python, and because of this almost every scientist working in python use this package.


It is not only widely used in data science and the machine learning world but also in different scientific domains like Quantum computing, Image processing, Bio-Informatics, etc.

The below figure provides you different Python packages which use NumPy as a base package to perform various numerical operations.














Some of the widely used python libraries and packages in the data science and machine learning world are also based on the NumPy package.




---------------------------------------------------------------------------------------------------------------------------

Index :

1. How to install NumPy?

2. What is NumPy?

3. Why is NumPy Fast?

4. How to import NumPy?

5. What’s the difference between a Python list and a NumPy array?

6. What is an array?

7. How to create an array in NumPy?


Q1. How to install NumPy?


Ans: You need only Python installed or a Python environment in order to install the NumPy package. If you don’t have Python yet and want the simplest way to get started, the best IDE is Anaconda Distribution - it includes Python, NumPy, and many other commonly used packages for scientific computing and data science along with some IDE's like Jupyter Notebook, Jupyter-lab, Spyder, etc.


CONDA

# Best practice, use an environment rather than install in the base env
conda create -n my-env
conda activate my-env
# If you want to install from conda-forge
conda config --env --add channels conda-forge
# The actual install command
conda install numpy

PIP

pip install numpy


Q2. What is NumPy?


NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object. At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. There are several important differences between NumPy arrays and the standard Python sequences.

  • NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original.

  • The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.

  • NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.

For Eg:

# Double the element of the list in python 
a = [1,2,3,4]
double = [x*2 for x in a]
print(double)
Output: [2,4,6,8]

Using NumPy package


# Double the element of the list using NumPy 
import numpy as np 
a = [1,2,3,4]
double = np.array(a) * 2
print(double)
Output: [2,4,6,8]


Q3. Why is NumPy Fast? ( Important Interview question)


The simple answer is Vectorization. In mathematics, a vector is something that has magnitude and direction. In programming and computer science, vectorization is the process of applying operations to an entire set of values at once.

The built-in methods in NumPy and Pandas are built with C, which allows for vectorization.


The Mechanism Behind Vectorization — SISD vs SIMD

Modern computer processors contain components that have particular computer architecture classifications that are relevant to understanding vectorization:


SISD — Single Instruction, Single Data

SIMD—Single Instruction, Multiple Data

  • SISD: This is the structure for how Python for-loops are processed—One instruction, per one data element, per one moment in time, in order to produce one result. The neat thing about this is that it is flexible — you may implement any operation on your data. The drawback is that it is not optimum for processing large amounts of data.

  • SIMD: This is the structure for how NumPy and Pandas vectorizations are processed—One instruction per any number of data elements per one moment in time, in order to produce multiple results. Contemporary CPUs have a component to process SIMD operations in each of their cores, allowing for parallel processing.

Want to know the complete comparison between Python, C, and NumPy? Check out this wonderful article👌.



Q4. How to import NumPy?


Importing Numpy is a very straightforward process. To access NumPy and its functions import it in your Python code like this:

import numpy as np

Note: Alias np is a widely used convention for the NumPy package.


Q5. What’s the difference between a Python list and a NumPy array?

LIST NumPy

Python list can contain different data types within a single list.

The elements in a NumPy array should be homogeneous.

Lists are slower compared to the NumPy array.

NumPy arrays are faster compared to List.

The data type of List is by default list irrespective of elements inside it and there is no mechanism of specifying the data types.

NumPy uses much less memory to store data and it provides a mechanism of specifying the data types.

Broadcasting is not possible

Broadcasting is possible.


Q6. What is an array?


An array is a central data structure of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the array dtype.

An array can be indexed by a tuple of nonnegative integers, by booleans, by another array, or by integers. The rank of the array is the number of dimensions. The shape of the array is a tuple of integers giving the size of the array along each dimension.

One way we can initialize NumPy arrays is from Python lists, using nested lists for two- or higher-dimensional data.

For example:

>>> a = np.array([1, 2, 3, 4, 5, 6])

or:

>>> a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

Q7. How to create an array in NumPy?


To create a NumPy array, you can use the function np.array().

All you need to do to create a simple array is pass a list to it. If you choose to, you can also specify the type of data in your list.

>>> import numpy as np
>>> a = np.array([1, 2, 3])
  • Besides creating an array from a sequence of elements, you can easily create an array filled with 0’s:

>>> np.zeros(2)
array([0., 0.])
  • Or an array filled with 1’s:

>>> np.ones(2)
array([1., 1.])
  • You can create an array with a range of elements:

>>> np.arange(4)
array([0, 1, 2, 3])
  • You can also use np.linspace() to create an array with values that are spaced linearly in a specified interval:

>>> np.linspace(0, 10, num=5)
array([ 0. ,  2.5,  5. ,  7.5, 10. ])
  • While the default data type is floating point (np.float64), you can explicitly specify which data type you want using the dtype keyword.

>>> x = np.ones(2, dtype=np.int64)
>>> x
array([1, 1])

Q8. NumPy Fundamentals:-


  1. Array Creation

  2. Array Indexing

  3. Array Broadcasting

  4. Array Applications

Array Creation:

1. Conversion from other Python structures (i.e. lists and tuples)

>>> a1D = np.array([1, 2, 3, 4])
>>> a2D = np.array([[1, 2], [3, 4]])
>>> a3D = np.array([[[1, 2], [3, 4]],[[5, 6], [7, 8]]])

2. Intrinsic NumPy array creation functions (e.g. arange, ones, zeros, etc.)

2D identity matrix. The elements where i=j (row index and column index are equal) are 1 and the rest are 0, as such:

>>> np.arange(10) ## 1D Array
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
       
>>> np.eye(3) ## 2D Square Matrix (Array)
array([[1., 0., 0.],
       [0., 1., 0.],       
       [0., 0., 1.]])
>>> np.eye(3, 5) ## 2D Matrix(Array) with 3 Rows and 5 Columns
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.]])
>>> np.zeros((2, 3, 2))
array([[[0., 0.],
       [0., 0.],        
       [0., 0.]],       
       [[0., 0.], 
       [0., 0.],        
       [0., 0.]]])       

3. Replicating, joining, or mutating existing arrays

Once you have created arrays, you can replicate, join, or mutate those existing arrays to create new arrays. When you assign an array or its elements to a new variable, you have to explicitly numpy.copy the array, otherwise the variable is a view into the original array.

Consider the following example:

>>> a = np.array([1, 2, 3, 4, 5, 6])
>>> b = a[:2]
>>> b += 1
>>> print('a =', a, '; b =', b)
    a = [2 3 3 4 5 6]; b = [2 3]

Using numpy.copy

>>> a = np.array([1, 2, 3, 4])
>>> b = a[:2].copy()
>>> b += 1
>>> print('a = ', a, 'b = ', b)
a =  [1 2 3 4 5 6], b =  [2 3]

4. Reading arrays from disk, either from standard or custom formats

Delimited files such as comma-separated value (CSV) and tab-separated value (tsv) files are used for programs like Excel and LabView. Python functions can read and parse these files line-by-line. NumPy has two standard routines for importing a file with delimited data numpy.loadtxt and numpy.genfromtxt. Check out the example given a simple.csv:

>>> np.loadtxt('simple.csv', delimiter = ',', skiprows = 1)
array([[0., 0.],[1., 1.],[2., 4.],[3., 9.]])

Array Indexing:



References:


  1. Nathan Cheever, PyGotham 2019—1000x faster data manipulation: vectorizing with Pandas and Numpy

  2. 3blue1brown — Vectors, what even are they? | Essence of linear algebra, chapter 1

  3. Code Mechanic: Numpy Vectorization - https://chelseatroy.com/2018/11/07/code-mechanic-numpy-vectorization/












10 views0 comments