Pandas #

From http://pandas.pydata.org/pandas-docs/stable/

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Standard way to load the module #

import pandas as pd

Introduction #

We already saw how NumPy arrays can improve the analysis of numerical data. For heterogeneous data the recommended tool are Pandas dataframes.

Heterogeneous and nested data can be stored as list of dictionaries. For example, for people with names, birth date, sex, and a job list with start and end date, we can have

# Create a dictionary for each person's data
person1 = {"Name": "John Doe", "Birth Date": "01-01-1980", "Sex": "Male", 
           "Job": [{"Job Title": "Software Engineer", "Start Date": "01-01-2000", "End Date": "01-01-2005"}, 
                   {"Job Title": "Data Scientist", "Start Date": "01-01-2005", "End Date": None}]}
person2 = {"Name": "Jane Doe", "Birth Date": "01-01-1985", "Sex": "Female", 
           "Job": [{"Job Title": "Product Manager", "Start Date": "01-01-2010", "End Date": "01-01-2015"}, 
                   {"Job Title": "Project Manager", "Start Date": "01-01-2015", "End Date": "01-01-2020"}]}
person3 = {"Name": "Jim Smith", "Birth Date": "01-01-1990", "Sex": "Male", 
           "Job": [{"Job Title": "Data Analyst", "Start Date": "01-01-2010", "End Date": "01-01-2015"}, 
                   {"Job Title": "Business Analyst", "Start Date": "01-01-2015", "End Date": "01-01-2020"}]}
person4 = {"Name": "Sara Johnson", "Birth Date": "01-01-1995", "Sex": "Female", 
           "Job": [{"Job Title": "Product Designer", "Start Date": "01-01-2015", "End Date": "01-01-2020"}, 
                   {"Job Title": "UX Designer", "Start Date": "01-01-2020", "End Date": None}]}

# Create a list of dictionaries
people = [person1, person2, person3, person4]

We can create a DataFrame from the list of dictionaries

df = pd.DataFrame(people)
df

	Name	Birth Date	Sex	Job
0	John Doe	01-01-1980	Male	[{'Job Title': 'Software Engineer', 'Start Dat...
1	Jane Doe	01-01-1985	Female	[{'Job Title': 'Product Manager', 'Start Date'...
2	Jim Smith	01-01-1990	Male	[{'Job Title': 'Data Analyst', 'Start Date': '...
3	Sara Johnson	01-01-1995	Female	[{'Job Title': 'Product Designer', 'Start Date...

As with NumPy, we can create masks in order to filter out specific rows of the dataframe. For example, to filter out the female persons by using the syntax:

df[df["Sex"] == "Female"]

	Name	Birth Date	Sex	Job
1	Jane Doe	01-01-1985	Female	[{'Job Title': 'Product Manager', 'Start Date'...
3	Sara Johnson	01-01-1995	Female	[{'Job Title': 'Product Designer', 'Start Date...

To filter out the last job of each person by using the following code (.get is a safer way to obtain the value of the key of a dictionary)

df['Last job']=df["Job"].apply(lambda L: L[-1].get('Job Title'))
df[['Name','Birth Date','Sex','Last job']]

	Name	Birth Date	Sex	Last job
0	John Doe	01-01-1980	Male	Data Scientist
1	Jane Doe	01-01-1985	Female	Project Manager
2	Jim Smith	01-01-1990	Male	Business Analyst
3	Sara Johnson	01-01-1995	Female	UX Designer

Basic structure: DataFrame #

An flat spreadsheet can be seen in terms of the types of variables of Python just as dictionary of lists, where each column of the spreadsheet is a pair key-list of the dictionary

	A	B
1	even	odd
2	0	1
3	2	3
4	4	5
5	6	7
6	8	9

numbers={"even": [0,2,4,6,8],   #  First  key-list
         "odd" : [1,3,5,7,9] }  #  Second key-list

Data structures #

Pandas has two new data structures:

DataFrame which are similar to numpy arrays but with some assigned key. For example, for the previous case

import numpy as np
np.array([[0,1],
          [2,3],
          [4,5],
          [6,7],
          [8,9] 
         ])

Series which are enriched to dictionaries, as the ones defined for the rows of the previous example: {'even':0,'odd':1}.

The rows in a two-dimensional DataFrame corresponds to Series with similar keys, while the columns are also Series with the indices as keys.

An example of a DataFrame is a spreadsheet, as the one before.

`DataFrame`#

Pandas can convert a dictionary of lists, like the numbers dictionary before, into a DataFrame, which is just an spreadsheet but interpreted at the programming level:

numbers

{'even': [0, 2, 4, 6, 8], 'odd': [1, 3, 5, 7, 9]}

import pandas as pd
df=pd.DataFrame(numbers)
df

	even	odd
0	0	1
1	2	3
2	4	5
3	6	7
4	8	9

import matplotlib.pyplot as plt

plt.plot(df['even'],df['odd'])

[<matplotlib.lines.Line2D at 0x7f7d7a779880>]

See below for other possibilities of creating Pandas DataFrames from lists and dictionaries

The main advantage of the DataFrame,df, upon a spreadsheet, is that it can be managed just at the programming level without any graphical interface.

We can check the shape of the DataFrame

df.shape

(5, 2)

Export DataFrame to other formats#

To export to excel:

df.to_excel('example.xlsx',index=False)

newdf=pd.read_excel('example.xlsx')
newdf

	even	odd
0	0	1
1	2	3
2	4	5
3	6	7
4	8	9

newdf['fractions']=[0.5,2.5,4.5,6.5,8.5]
newdf

	even	odd	fractions
0	0	1	0.5
1	2	3	2.5
2	4	5	4.5
3	6	7	6.5
4	8	9	8.5

newdf['next fractions']=1.5
newdf

	even	odd	fractions	next fractions
0	0	1	0.5	1.5
1	2	3	2.5	1.5
2	4	5	4.5	1.5
3	6	7	6.5	1.5
4	8	9	8.5	1.5

newdf.loc[3,'next to next fractions']=1.7
newdf

	even	odd	fractions	next fractions	(3, next to next fractions)	next to next fractions
0	0	1	0.5	1.5	1.7	NaN
1	2	3	2.5	1.5	1.7	NaN
2	4	5	4.5	1.5	1.7	NaN
3	6	7	6.5	1.5	1.7	1.7
4	8	9	8.5	1.5	1.7	NaN

Activity: Open the resulting spreadsheet in Google Drive, publish it and open from the resulting link with Pandas in the next cell

df=pd.read_excel('PASTE THE PUBLISHED LINK HERE')
df

df=pd.read_excel('https://docs.google.com/spreadsheets/d/e/2PACX-1vQ1HFwErJcHkkOCT4Je-yuLSRe2L_GKWcCGVooc6rbOvTLxJhqglTZh31I_eB_dcw/pub?output=xlsx')
df

	even	odd
0	0.0	1.0
1	2.0	3.0
2	4.0	5.0
3	6.0	7.0
4	8.0	9.0

`Series`#

Each column of the DataFrame is now an augmented dictionary called Series, with the indices as the keys of the Series

A Pandas Series object can be just initialized from a Python dictionary:

df['even']

  0
  2
  4
  6
  8
Name: even, dtype: int64

type( df['even'] )

pandas.core.series.Series

df.even

  0
  2
  4
  6
  8
Name: even, dtype: int64

The keys are the index of the DataFrame

#df['even']
df.even[4]

Each row is also a series

df.loc[0]

even    0
odd     1
Name: 0, dtype: int64

with keys: 'even' and 'odd'

or as a filter

df.loc[[4]]

	even	odd
4	8	9

df.loc[0]['even']

or attributes even and odd

df.loc[0].odd

One specific cell value can be reached with the index and the key:

df.iloc[2,1]

df.loc[2,'odd']

df.at[2,'even']

A Pandas Series object can be just initialized from a Python dictionary:

s=pd.Series({'Name':'Juan Valdez','Nacionality':'Colombia','Age':23})
s

Name           Juan Valdez
Nacionality       Colombia
Age                     23
dtype: object

s['Name']

'Juan Valdez'

but also as containers of name spaces!

s.Name

'Juan Valdez'

The power of Pandas rely in that their main data structures: DataFrames and Series, are enriched with many useful methods and attributes.

Official definition of Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

“relational”: the list of data is identified with some unique index (like a SQL table)
“labeled”: the list is identified with a key, like the previous odd or even keys.

For example. A double bracket [[...]], can be used to filter data.

A row in a two-dimensional DataFrame corresponds to Series with the same keys of the DataFrame, but with single values instead of a list

df.loc[[0]]

	even	odd
0	0	1

To filter a column:

df[['odd']]

	odd
0	1
1	3
2	5
3	7
4	9

`DataFrame` initialization #

Initialization from an existing spreadsheet.#

This can be locally in your computer o from some downloadable link

df=pd.read_excel('http://bit.ly/spreadsheet_xlsx')
df

	Nombre	Edad	Compañia
0	Juan Valdez	23.0	Café de Colombia
1	Álvaro Uribe Vélez	65.0	Senado de la República

To make a downloadable link for any spread sheet in Google Drive, follow the sequence:

File → Share → Publish to the web...→ Entire Document → Web page → Microsoft excel (xlsx)

df.loc[0,'Edad']=32
#df.at[0,'Edad']=32
df

	Nombre	Edad	Compañia
0	Juan Valdez	32.0	Café de Colombia
1	Álvaro Uribe Vélez	65.0	Senado de la República

After some modification

it can be saved again as an excel file with the option to not create a column of indices: index=False

Initialization from lists and dictionaries#

Inizialization from Series#

We start with an empty DataFrame:

Creating Pandas DataFrame from list and dictionaries offers many alternatives

creating dataframes

Column oriented way#

In addition to the dictionary of lists already illustrated at the beginning that in this case corresponds to:

pd.DataFrame({'Nombre'   : ['Juan Valdez','Álvaro Uribe Vélez'],
              'Edad'     : [32,            69                 ],
              'Compañia' : ['Café de Colombia','Senado de la República']})

	Nombre	Edad	Compañia
0	Juan Valdez	32	Café de Colombia
1	Álvaro Uribe Vélez	69	Senado de la República

We can obtain the DataFrame from list of items

pd.DataFrame.from_items([ [ 'Nombre'  , ['Juan Valdez','Álvaro Uribe Vélez']],
                          [ 'Edad'    , [  32,            65               ]],
                          [ 'Compañia', ['Café de Colombia','Senado de la República']] ])

We can obtain the DataFrame from dictionary

pd.DataFrame( [{'Nombre':'Juan Valdez',        'Edad': 32   ,'Compañia':'Café de Colombia'      },
              {'Nombre':'Álvaro Uribe Vélez', 'Edad': 65   ,'Compañia':'Senado de la República'}]
            )

	Nombre	Edad	Compañia
0	Juan Valdez	32	Café de Colombia
1	Álvaro Uribe Vélez	65	Senado de la República

Special DataFrames#

Empty DataFrame#

df=pd.DataFrame()
df

df.empty

True

Single row DataFrame from dictionary#

d={
    'first key' :'first value',
    'second key':'second value'

  }
pd.DataFrame([d])

	first key	second key
0	first value	second value

Initialization from sequential rows as Series#

We start with an empty DataFrame:

import pandas as pd
df=pd.DataFrame()
df.empty

True

We can append a dictionary (or Series) as a row of the DataFrame, provided that we always use the option: ignore_index=True

d={'Name':'Juan Valdez','Nacionality':'Colombia','Age':23}
df=pd.concat([df,pd.DataFrame([d])])
df

	Name	Nacionality	Age
0	Juan Valdez	Colombia	23

To add a second file we build another dict

d={}
for k in ['Name','Nacionality','Age','Company']:
    var=input('{}:\n'.format(k))
    d[k]=var

Name:
 Diego Restrepo
Nacionality:
 Colombia
Age:
 51
Company:
 UdeA

{'Name': 'Diego Restrepo',
 'Nacionality': 'Colombia',
 'Age': '51',
 'Company': 'UdeA'}

df=pd.concat([df,pd.DataFrame([d])])

To concatenate a list of dataframes side by side use the option axis='columns'

Exercises#

Display the resulting Series in the screen:

df['Name']

0       Juan Valdez
1    Diego Restrepo
Name: Name, dtype: object

Activity: Append a new row to the previous DataFrame and visualize it:

Fill NaN with empty strings

df=df.fillna('')

df

	Name	Nacionality	Age	Company
0	Juan Valdez	Colombia	23
1	Diego Restrepo	Colombia	51	UdeA

Save Pandas DataFrame as an Excel file

df.to_excel('prof.xlsx',index=False)

Load pandas DataFrame from the saved file in Excel

pd.read_excel('prof.xlsx')

	Name	Nacionality	Age	Company
0	Juan Valdez	Colombia	23	NaN
1	Diego Restrepo	Colombia	51	UdeA

	Name	Age
0	Donald Trump	74
1	Barak Obama	59

	Name	Age
0	Donald Trump	74
1	Barak Obama	59

	Name	Age
0	Donald Trump	73
1	Barak Obama	58

	Name	Age
0	Donald Trump, Junior	73
1	Barak Obama, Senior	58

	Name	Age	name
0	Donald Trump, Junior	73	Donald Trump
1	Barak Obama, Senior	58	Barak Obama

	Name	Age
0	Donald Trump Junior	73
1	Barak Obama Senior	58

	Name	Age	name
0	Donald Trump Junior	73	Donald Trump
1	Barak Obama Senior	58	Barak Obama

Common operations upon `DataFrames`#

See https://github.com/restrepo/PythonTipsAndTricks

To fill a specific cell

df.at[0,'Company']='Federación de Caferos'

df

	Name	Nacionality	Age	Company
0	Juan Valdez	Colombia	23	Federación de Caferos
1	Diego Restrepo	Colombia	51	UdeA

Filters (masking)#

The main application of labeled data for data analysis is the possibility to make filers, or cuts, to obtain specific reduced datasets to further analysis

import pandas as pd

numbers={"even": [0,2,4,-6,8],   #  First  key-list
         "odd" : [1,3,-5,7,9] }  #  Second key-list

df=pd.DataFrame(numbers)

df

	even	odd
0	0	1
1	2	3
2	4	-5
3	-6	7
4	8	9

A mask is a list of True/False values

df.even.abs()>4

  False
  False
  False
   True
   True
Name: even, dtype: bool

df[df.even.abs()>4]

	even	odd
3	-6	7
4	8	9

and → &

df[(df.even>0) & (df.odd<0)]

	even	odd
2	4	-5

negation → ~

df[~((df.even>0) & (df.odd<0)) ]

	even	odd
0	0	1
1	2	3
3	-6	7
4	8	9

or → |

df[(df.even<0) | (df.odd<0)]

	even	odd
2	4	-5
3	-6	7

The `apply` method #

The advantage of the spreadsheet paradigm is that the columns can be transformed with functions. All the typical functions avalaible for a spreadsheet are already implemented like the method .abs() used before, or the method: .sum()

df.even.sum()

Activity: Explore the avalaible methods by using the completion system of the notebook after the last semicolon of df.even.

kk=df['even']

kk.

df[‘even’].ipynb_checkpoints/

Column-level `apply`#

We just select the column and apply the direct or implicit function:

Pre-defined function

df.even.apply(abs)

  0
  2
  4
  6
  8
Name: even, dtype: int64

Implicit function

df.even.apply(lambda n:isinstance(n,int))

  True
  True
  True
  True
  True
Name: even, dtype: bool

df.even.apply(lambda n: n**2)

   0
   4
  16
  36
  64
Name: even, dtype: int64

Row-level apply#

The foll row is passed as dictionary to the explicit or implicit function when apply is used for the full DataFrame and the option axis=1 is used at the end

df

	even	odd
0	0	1
1	2	3
2	4	-5
3	-6	7
4	8	9

df['even']+df['odd']**2

   1
  11
  29
  43
  89
dtype: int64

df.apply(lambda row: row['even']+row['odd']**2,axis='columns')

   1
  11
  29
  43
  89
dtype: int64

df.apply(lambda row: row.get('even')+row.get('odd')**2,axis='columns')

   1
  11
  29
  43
  89
dtype: int64

Chain tools for data analysis #

There are several chain tools for data analyis like the

Spreadsheet based one, like Excel
Relational databases with the use of more advanced SQL tabular data with some data base software like MySQL
Non-relational databases (RAM) with Pandas, R, Paw,… ( max ~ RAM/8)
Non-relational databases (Disk): Dask, ROOT, MongoDB,…

Here we illustrate an example of use fo a non-relational database with Pandas

Relational databases #

import pandas as pd

personas=pd.read_csv('https://raw.githubusercontent.com/restrepo/ComputationalMethods/master/data/personas.csv')
personas

	Nombre	Fecha de Nacimiento	id
0	Juan Valdez	1966-07-04	888
1	Álvaro Uribe Vélez	1952-07-04	666

trabajos=pd.read_csv('https://raw.githubusercontent.com/restrepo/ComputationalMethods/master/data/trabajos.csv',
                     na_filter=False)
trabajos

	id	Inicio	Fin	Cargo	Compañía
0	888	2010		Arriero	Café de Colombia
1	666	2013	2020	Senador	Senado de la República de Colombia
2	666	2020		Influencer	Twitter

Example#

Obtain the current work of Álvaro Uribe Vélez

trabajos

	id	Inicio	Fin	Cargo	Compañía
0	888	2010		Arriero	Café de Colombia
1	666	2013	2020	Senador	Senado de la República de Colombia
2	666	2020		Influencer	Twitter

It is convenient to normalize the columns with strings before to tray to search inside them with a DataFrame method like .

import unidecode

unidecode.unidecode('Álvaro de Uribe').lower()

'alvaro de uribe'

cc=personas[personas['Nombre'].str.lower().apply(
    unidecode.unidecode).str.contains('alvaro uribe velez')].iloc[0].get('id')

trabajos[trabajos.get('id')==cc]['Cargo'].to_list()

['Senador', 'Influencer']

Non-relational databases #

Nested lists of dictionaries with a defined data scheme

personas['Fecha de Nacimiento']=pd.to_datetime( personas['Fecha de Nacimiento'] )

personas

	Nombre	Fecha de Nacimiento	id
0	Juan Valdez	1966-07-04	888
1	Álvaro Uribe Vélez	1952-07-04	666

Extract-Transform-Load: ETL

from dateutil.relativedelta import relativedelta

personas['Edad']=personas['Fecha de Nacimiento'].apply(lambda t: 
                        relativedelta( pd.to_datetime('now'), t).years )

/home/usuario/anaconda3/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2199: FutureWarning: The parsing of 'now' in pd.to_datetime without `utc=True` is deprecated. In a future version, this will match Timestamp('now') and Timestamp.now()
  result, tz_parsed = tslib.array_to_datetime(

trabajos[trabajos['id']==666].to_dict(orient='records')

[{'id': 666,
  'Inicio': 2013,
  'Fin': '2020',
  'Cargo': 'Senador',
  'Compañía': 'Senado de la República de Colombia'},
 {'id': 666,
  'Inicio': 2020,
  'Fin': '',
  'Cargo': 'Influencer',
  'Compañía': 'Twitter'}]

personas

	Nombre	Fecha de Nacimiento	id	Edad
0	Juan Valdez	1966-07-04	888	55
1	Álvaro Uribe Vélez	1952-07-04	666	69

personas['id']

0    888
1    666
Name: id, dtype: int64

personas['Trabajos']=personas['id'].apply(lambda i:  trabajos[trabajos['id']==i
                                                             ][['Inicio','Fin','Cargo','Compañía']
                                                              ].to_dict(orient='records') )

personas

	Nombre	Fecha de Nacimiento	id	Edad	Trabajos
0	Juan Valdez	1966-07-04	888	55	[{'Inicio': 2010, 'Fin': '', 'Cargo': 'Arriero...
1	Álvaro Uribe Vélez	1952-07-04	666	69	[{'Inicio': 2013, 'Fin': '2020', 'Cargo': 'Sen...

personajes=personas[['Nombre','Edad','Trabajos']]

personajes

	Nombre	Edad	Trabajos
0	Juan Valdez	55	[{'Inicio': 2010, 'Fin': '', 'Cargo': 'Arriero...
1	Álvaro Uribe Vélez	69	[{'Inicio': 2013, 'Fin': '2020', 'Cargo': 'Sen...

personajes.to_dict(orient='records')

[{'Nombre': 'Juan Valdez',
  'Edad': 55,
  'Trabajos': [{'Inicio': 2010,
    'Fin': '',
    'Cargo': 'Arriero',
    'Compañía': 'Café de Colombia'}]},
 {'Nombre': 'Álvaro Uribe Vélez',
  'Edad': 69,
  'Trabajos': [{'Inicio': 2013,
    'Fin': '2020',
    'Cargo': 'Senador',
    'Compañía': 'Senado de la República de Colombia'},
   {'Inicio': 2020, 'Fin': '', 'Cargo': 'Influencer', 'Compañía': 'Twitter'}]}]

from IPython.display import JSON

JSON( personajes.to_dict(orient='records') )

<IPython.core.display.JSON object>

Actividad#

Obtenga el último trabajo de Álvaro Uribe Vélez

personajes[personajes['Nombre']=='Álvaro Uribe Vélez'
          ].get('Trabajos'
          ).apply(lambda l: [d.get('Cargo') for d in l if not d.get('Fin')]
          ).str[0].to_list()[0]

'Influencer'

We have shown that the simple two dimensional spreadsheets where each cell values is a simple type like string, integer, or float, can be represented as a dictionary of lists values or a list of dictionary column-value assignment.

We can go further and allow to store in the value itself a more general data structure, like nested lists and dictionaries. This allows advanced data-analysis when the apply methos is used to operate inside the nested lists or dictionaries.

See for example:

World wide web #

There are really three kinds of web

The normal web,
The deep web,
The machine web. The web for machine readable responses. It is served in JSON or XML formats, which preserve programming objects.

Normal web#

pd.read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory')[0][1:]

	COVID-19 pandemic	COVID-19 pandemic.1
1	Disease	COVID-19
2	Virus strain	SARS-CoV-2
3	Source	Probably bats, possibly via pangolins[1][2]
4	Location	Worldwide
5	First outbreak	Mainland China[3]
6	Index case	Wuhan, Hubei, China.mw-parser-output .geo-default,.mw-parser-output .geo-dms,.mw-parser-output .geo-dec{display:inline}.mw-parser-output .geo-nondefault,.mw-parser-output .geo-multi-punct{display:...
7	Date	1 December 2019[3] – present(1 year, 3 months, 3 weeks and 3 days)
8	Confirmed cases	124,971,776[4]
9	Active cases	51,331,455[4]
10	Recovered	70,893,740[4]
11	Deaths	2,746,581[4]
12	Territories	192[4]

Real world example: microsoft academics

Machine web#

For example, consider the following normal web page:

https://inspirehep.net/literature?q=doi:10.1103/PhysRevLett.122.132001

about a Scientific paper with people from the University of Antioquia. A machine web version can be easily obtained in JSON just by attaching the extra parameter &of=recjson, and direcly loaded from Pandas, which works like a browser for the third web:

import pandas as pd

df=pd.read_json('https://inspirehep.net/api/literature?q=doi:10.1103/PhysRevLett.122.132001')

df

	hits	links
hits	[{'metadata': {'report_numbers': [{'value': 'C...	NaN
total	1	NaN
self	NaN	https://inspirehep.net/api/literature/?q=doi%3...
bibtex	NaN	https://inspirehep.net/api/literature/?q=doi%3...
latex-eu	NaN	https://inspirehep.net/api/literature/?q=doi%3...
latex-us	NaN	https://inspirehep.net/api/literature/?q=doi%3...
json	NaN	https://inspirehep.net/api/literature/?q=doi%3...

We can use all the previous methods to extract the authors from 'Antioquia U.':

Note: For a dictionary, d is safer to use d.get('key') instead of just d['key'] to obtain some key, because not error is generated if the requested key does not exists at all

df[df['hits'].apply(lambda l: isinstance(l,list))]['hits' # extract cell with list
            ].apply(lambda l: [d.get('metadata') for d in l] # metadata of article
            ).str[0 #get the matched article dictionary
            ].str['authors' # get list of authors → l
            ].apply(lambda l: [ f'{d.get("first_name")} {d.get("last_name")}' for d in l  #author is a dictionary → d
                               #d.get('affiliations') is a list  of dictionaries → dd                               
                               if 'Antioquia U.' in [dd.get('value') for dd in d.get('affiliations')] 
                              ])

hits    [Jhovanny Mejia Guisao, José David Ruiz Alvarez]
Name: hits, dtype: object

Authors=df[df['hits'].apply(lambda l: isinstance(l,list))]['hits' # extract cell with list
            ].apply(lambda articles: [article.get('metadata') for article in articles] # metadata of article
            ).str[0 #get the matched article dictionary
            ].str['authors' # get list of authors → l
            ]

names=Authors.apply(lambda authors: [ author.get('full_name') for author in authors  #author is a dictionary
                               #author.get('affiliations') is a list  of dictionaries → affiliation                              
                               if 'Antioquia U.' in [affiliation.get('value') for affiliation in author.get('affiliations')] 
                              ])
names[0]

['Mejia Guisao, Jhovanny', 'Ruiz Alvarez, José David']

We can see that the column authors is quite nested: Is a list of dictionaries with the full information for each one of the authors of the article.

Activity: Check that the lenght of the auhors list coincides with the number_of_authors

For further details see: https://github.com/restrepo/inspire/blob/master/gfif.ipynb

Activity: Repeat the same activity but using directly the JSON file, obtained with requests

#See: https://github.com/inspirehep/rest-api-doc/issues/4#issuecomment-645218074
import requests                                                                                                                                                      
response = requests.get('https://inspirehep.net/api/doi/10.1103/PhysRevLett.122.132001')                                                                              
authors = response.json()['metadata']['authors']                                                                                                                     
names = [author.get('full_name')
              for author in authors 
               if any(aff.get('value') == 'Antioquia U.' for aff in author.get('affiliations'))]
names

['Mejia Guisao, Jhovanny', 'Ruiz Alvarez, José David']

Summary #

Pandas_Cheat_Sheet PDF

ACTIVITIES #

See:

https://github.com/ajcr/100-pandas-puzzles
https://github.com/guipsamora/pandas_exercises
https://rramosp.github.io/ai4eng.v1/content/NOTES%2002.04%20-%20PANDAS.html

Final remarks #

With basic scripting and Pandas we already have a solid environment to analyse data. We introduce the other libraries motivated with the extending the capabilities of Pandas

Appendix #

Summary with ChatGPT

Created in Deepnote

	even	odd	fractions	next fractions	(3, next to next fractions)	next to next fractions
0	0	1	0.5	1.5	1.7	NaN
1	2	3	2.5	1.5	1.7	NaN
2	4	5	4.5	1.5	1.7	NaN
3	6	7	6.5	1.5	1.7	1.7
4	8	9	8.5	1.5	1.7	NaN

	even	odd	fractions	next fractions	(3, next to next fractions)	next to next fractions
0	0	1	0.5	1.5	1.7	NaN
1	2	3	2.5	1.5	1.7	NaN
2	4	5	4.5	1.5	1.7	NaN
3	6	7	6.5	1.5	1.7	1.7
4	8	9	8.5	1.5	1.7	NaN

Computational Methods

Pandas

Contents

DataFrame#