Skip to main content

Using Spark for Data Profiling or Exploratory Data Analysis

 

Using Spark for Data Profiling or Exploratory Data Analysis

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to find out whether existing data can easily be used for other purposes. 

Before any dataset is used for advanced data analytics, an exploratory data analysis (EDA) or data profiling step is necessary. This is an ideal solution for datasets containing personal data because only aggregated data are shown. The Social-3 Personal Data Framework provides metadata and data profiling information of each available dataset. One of the earliest steps after the data ingestion step is the automated creation of a data profile.

Exploratory data analysis (EDA) or data profiling can help assess which data might be useful and reveals the yet unknown characteristics of such new dataset including data quality and data transformation requirements before data analytics can be used.

Data consumers can browse and get insight in the available datasets in the data lake of the Social-3 Personal Data Framework and can make informed decision on their usage and privacy requirements. The Social-3 Personal Data Framework contains a data catalogue that allows data consumers to select interesting datasets and put them in a “shopping basket” to indicate which datasets they want to use and how they want to use them.

Before using a dataset with any algorithm it is essential to understand how the data looks like and what are the edge cases and distribution of each attribute. Questions that need to be answered are related to the distribution of the attributes (columns of the table), the completeness or the missing data.

EDA can in a subsequent cleansing step be translated into constraints or rules that are then enforced. For instance, after discovering that the most frequent pattern for phone numbers is (ddd)ddd-dddd, this pattern can be promoted to the rule that all phone numbers must be formatted accordingly. Most cleansing tools can then either transform differently formatted numbers or at least mark them as violations.

Most of the EDA provides summary statistics for each attribute independently. However, some are based on pairs of attributes or multiple attributes. Data profiling should address following topics:

  • Completeness: How complete is the data? What percentage of records has missing or null values?
  • Uniqueness: How many unique values does an attribute have? Does an attribute that is supposed to be unique key, have all unique values?
  • Distribution: What is the distribution of values of an attribute?
  • Basic statistics: The mean, standard deviation, minimum, maximum for numerical attributes.
  • Pattern matching: What patterns are matched by data values of an attribute?
  • Outliers: Are there outliers in the numerical data?
  • Correlation: What is the correlation between two given attributes? This kind of profiling may be important for feature analysis prior to building predictive models.
  • Functional dependency: Is there functional dependency between two attributes?

The advantages of EDA can be summarized as:

  • Find out what is in the data before using it
  • Get data quality metrics
  • Get an early assessment on the difficulties in creating business rules
  • Input the a subsequent cleansing step
  • Discover value patterns and distributions
  • Understanding data challenges early to avoid delays and cost overruns
  • Improve the ability to search the data

Data volumes can be so large that traditional EDA or data profiling, using for example a python script, for computing descriptive statistics become intractable. But even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must sometimes be used.

However, using Spark for data profiling or EDA might provide enough capabilities to compute summary statistics on very large datasets.

Exploratory data analysis or data profiling are typical steps performed using Python and R, but since Spark has introduced dataframes, it will be possible to do the exploratory data analysis step in Spark, especially for the larger datasets.

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

The data are stored in RDDs (with schema), which means you can also process the dataframes with the original RDD APIs, as well as algorithms and utilities in MLLib.

One of the useful functions for Spark Dataframes is the describe method. It returns summary statistics for numeric columns in the source dataframe. The summary statistics includes min, max, count, mean and standard deviations. It takes the names of one or more columns as arguments.



The above results provides information about missing data (e.g. a StartYear = 0, or an empty StopYear) and the type and range of data. Also notice that numeric calculations are sometimes made on a non-numeric field such as the ICD9Code.



The most basic form of data profiling is the analysis of individual columns in a given table. Typically, generated metadata comprises various counts, such as the number of values, the number of unique values, and the number of non-null values. Following statistics are calculated:

StatisticsDescription
CountUsing the Dataframe describe method
AverageUsing the Dataframe describe method
MinimumUsing the Dataframe describe method
MaximumUsing the Dataframe describe method
Standard deviationUsing the Dataframe describe method
Missing valuesUsing the Dataframe filter method
DensityRatio calculation
Min. string lengthUsing the Dataframe expr, groupBy, agg, min, max, avg methods
Max. string lengthUsing the Dataframe expr, groupBy, agg, min, max, avg methods
# uniques valuesUsing the Dataframe distinct and count methods
Top 100 of most frequent valuesUsing the Dataframe groupBy, count, filter, orderBy, limit methods

Comments

Popular posts from this blog

Install Gradle on Windows

First, please make sure you have PowerShell installed on your Windows machine. And run the following commands: Set-ExecutionPolicy RemoteSigned -scope CurrentUser iex (new-object net.webclient).downloadstring('https://get.scoop.sh') scoop install gradle And then you can run   gradle --version if you see something like: Build time:   2017-04-10 13:37:25 UTC Revision:     b762622a185d59ce0cfc9cbc6ab5dd22469e18a6 Groovy:       2.4.10 Ant:          Apache Ant(TM) version 1.9.6 compiled on June 29 2015 JVM:          1.8.0_131 (Oracle Corporation 25.131-b11) OS:           Windows 10 10.0 amd64 This means the Gradle has been installed on your computer successfully.

Trojan脚本安装

  前言 很多小伙伴找Trojan脚本也是够了,到处都是。特别是一些新手小伙伴,根本不晓得应该用哪一个脚本,今天作者为大家整理出了常用一些Trojan脚本,并且详细的叙述一些脚本的功能和安装步骤! Trojan搭建失败?作者收集了市面上稳定性最高的一些脚本供大家自行选择。 通用的准备工作 1、VPS一台,看脚本的功能,内存需求不同 ( 购买VPS ) 2、域名一个,并做好解析 ( 不会请点击 ) 3、提前安装好BBR加速 ( 快速直达BBRPLUS脚本 ) Trojan客户端 正在制作 Trojan脚本合集 一、Jrohy的一键Trojan面板脚本 (推荐) 特点 在线web页面和命令行两种方式管理trojan多用户 启动 / 停止 / 重启 trojan 服务端 支持流量统计和流量限制 命令行模式管理, 支持命令补全 集成acme.sh证书申请 生成客户端配置文件 支持trojan://分享链接和二维码分享(二维码仅限web页面) 本脚本详细博文: 点击访问 #安装/更新 source <( curl - sL https : //git.io/trojan-install) #卸载 source <( curl - sL https : //git.io/trojan-install) --remove 二、官方Trojan脚本 特点 原汁原味的官方味道,免得说脚本挖了你的矿。 官方GitHub地址: 点击访问 脚本如下: apt install sudo #debian yum install sudo #contos sudo bash - c "$(curl -fsSL https://raw.githubusercontent.com/trojan-gfw/trojan-quickstart/master/trojan-quickstart.sh)" 注意事项: 官方脚本需要自己设置服务器配置,包括密码、证书等,设置路径为: /usr/local/etc/trojan.config.json 三、Trojan全智能一键安装脚本 特点 自动获取Trojan官方最新版本进行部署,全智能化。 本脚本详细博文: 点击访问 yum - y install wget ##ContOS Yum 安装 wget apt - ge...

阿里云 ECS linux下搭建openvpn服务器

  linux下搭建openvpn服务器 一、下载服务端安装脚本并获取权限 wget https://git.io/vpn -O openvpn-install.sh chmod +x openvpn-install.sh ./openvpn-install.sh 输入自己的配置信息 ip:0.0.0.0 protocol:tcp port:1194 DNS:current system resolvers Client:客户端的名称(也是客户端文件名) 创建成功后 会在当前目录下生成一个客户端需要使用的 ovpn后缀 文件,用ftp工具获取到本地 二、本地配置 下载客户端windows版本 链接:  客户端下载地址 如果在本地无法下载 可以在服务器下载后再通过ftp的方式拉到本地 wget https: / /swupdate.openvpn.org/community /releases/openvpn -install- 2.4 . 9 -I601-Win1 0 .exe 客户端安装和windows的软件基本一样,一直下一步就可以了。 安装完成后 桌面会自动创建一个Gui的启动器 安装好客户端后,将我们从服务器获取的文件放到客户端安装目录下的config目录下 开启Gui界面,在任务栏会多出一个类似网络连接的图标(如果没有可能在任务栏的隐藏界面或用管理员启动) 右键可以看到当前config目录下我们放置所有可以使用的客户端 选择任意链接 右键连接 停止服务 systemctl stop openvpn@- server 启动服务 systemctl start openvpn@- server 在服务端生成配置 ./openvpn-install.sh # Add a new client 选择此项就会在当前目录下生成 ovpn 配置文件 服务端配置地址 cd /etc/openvpn/server vi server.conf Linux整体比较简单 参考资料: https://blog.csdn.net/chastera/article/details/108183207