Friday, October 7, 2022
HomeBig DataAWS Glue Python shell now helps Python 3.9 with a versatile pre-loaded...

AWS Glue Python shell now helps Python 3.9 with a versatile pre-loaded setting and help to put in extra libraries


AWS Glue is the central service of an AWS fashionable information structure. It’s a serverless information integration service that lets you uncover, put together, and mix information for analytics and machine studying. AWS Glue presents you a complete vary of instruments to carry out ETL (extract, rework, and cargo) on the proper scale. AWS Glue Python shell jobs are designed for working small-to-medium measurement ETL, and triggering SQLs (together with long-running queries) on Amazon Redshift, Amazon Athena, Amazon EMR, and extra.

Right this moment, we’re excited to announce a brand new launch of AWS Glue Python shell that helps Python 3.9 with extra pre-loaded libraries. Moreover, it lets you customise your Python shell setting with pre-loaded libraries and presents you PIP help to put in different native or customized Python libraries.

The brand new launch of AWS Glue Python shell contains the mandatory Python libraries to attach your script to SQL engines and information warehouses like SQLAlchemy, PyMySQL, pyodbc, psycopg2, redshift, and extra. It additionally helps communications with different AWS companies comparable to Amazon OpenSearch Service (opensearch-py, elasticsearch), Amazon Neptune (gremlinpython), or Athena (PyAthena). It integrates Amazon SageMaker Information Wrangler for ETL duties like loading and unloading information from information lakes, information warehouses, and databases. It additionally contains library help for information serialization in business codecs comparable to avro and et-xmlfile.

On this put up, we stroll you thru on learn how to use AWS Glue Python shell to create an ETL job that imports an Excel file and writes it in a relational database and information warehouse. The job reads the Excel file as a Pandas DataFrame, creates a knowledge profiling report, and exports it into your Amazon Easy Storage Service (Amazon S3) bucket. This routine cleans inaccurate info and imputes lacking values based mostly on predefined enterprise guidelines. It writes the information right into a goal MySQL database for low-latency information entry. Moreover, in parallel, the script exports the DataFrame within the information lake in columnar format to be copied into Amazon Redshift for reporting and visualization.

AWS Glue Python shell new options

The brand new launch of AWS Glue Python shell lets you use new options of Python 3.9 and add customized libraries to your script utilizing job parameter configurations. This provides you extra flexibility to put in writing your Python code and reduces the necessity to manually preserve and replace Python libraries wanted on your code.

Personalized pre-loaded library environments

AWS Glue Python shell for Python 3.9 comes with two library setting choices:

  • analytics (default) – You’ll be able to run your script in a fullly pre-loaded setting for advanced analytics workloads. This selection hundreds the complete bundle of libraries.
  • none – You’ll be able to select an empty setting for easy and quick ETL jobs. This selection solely hundreds awscli and botocore as primary libraries.

You’ll be able to set this feature through the use of the library-set parameter within the job creation, for instance:

"library-set":"analytics"

In your reference, the next desk lists the libraries included in every possibility.

Python model Python 3.9
Library set analytics (default) none
avro 1.11.0 .
awscli 1.23.5 1.23.5
awswrangler 2.15.1 .
botocore 1.23.5 1.23.5
boto3 1.22.5 .
elasticsearch 8.2.0 .
numpy 1.22.3 .
pandas 1.4.2 .
psycopg2 2.9.3 .
pyathena 2.5.3 .
PyMySQL 1.0.2 .
pyodbc 4.0.32 .
pyorc 0.6.0 .
redshift-connector 2.0.907 .
requests 2.27.1 .
scikit-learn 1.0.2 .
scipy 1.8.0 .
SQLAlchemy 1.4.36 .
s3fs 2022.3.0 .

Added help for library compilers

On this launch, you’ll be able to import and set up libraries as a part of the script, together with your personal C-based libraries. You have got PIP help to put in native or buyer offered Python libraries with the help of the next compilers:

  • gcc
  • gcc-c++
  • gmake
  • cmake
  • cython
  • boost-devel
  • conda
  • python-dev

If you wish to embody a brand new bundle throughout your job creation, you’ll be able to add the job parameter --additional-python-modules adopted by the title of the library and the model. For instance:

"--additional-python-modules":"boto3=1.22.13"

How one can use the brand new options with the AWS Glue Python shell script

Now that we’ve launched the brand new options, let’s create a Python 3.9 job with extra libraries with AWS Glue Python shell. You have got two choices to create and submit a job: you need to use the interface of AWS Glue Studio, or the AWS Command Line Interface (AWS CLI) for a programmatic method.

AWS Glue Studio

To make use of AWS Glue Studio, full the next steps:

  1. On the AWS Glue Studio console, create a brand new job and choose Python Shell script editor.
    blank
  2. Enter a job title and enter your Python script.
    blank
  3. On the Job particulars tab, enter an optionally available description.
  4. For IAM function¸ select your job function.
  5. For Python model, select Python 3.9.
  6. Choose Load frequent Python libraries.
    blank
  7. Select the script and the short-term information places.
  8. Embody the extra libraries as job parameters (--additional-python-modules).

AWS CLI

With the brand new launch, now you can use the AWS CLI with the brand new parameters. The next is an instance of an AWS CLI assertion to create the AWS Glue Python shell script job with Python 3.9:

$ aws glue create-job 
--name <job_name> 
--role <glue_role> 
--command 
Identify=pythonshell, 
PythonVersion=3.9, 
ScriptLocation=s3://<path_to_your_python_script>.py 
--default-arguments 
'{
    "--TempDir":"s3://<path_to_your_temp_dir>",
    "--job-language":"python",
    "library-set":"<analytics/default/none>",
    "--additional-python-modules":"<python bundle>=<model>, <>=<>"
}'
--connections <your_glue_connection> 
--timeout 30 
--max-capacity 0.0625

Let’s discover the primary variations from the earlier AWS Glue Python shell variations:

  • Set the choice PythonVersion throughout the --command parameter to three.9.
  • So as to add new libraries, use --additional-python-modules as a brand new parameter after which checklist the library and the required model as follows: boto3=1.22.13.
  • Embody library-set inside –default-arguments and select one of many values, comparable to default/analytics/none.

Answer overview

This tutorial demonstrates the brand new options utilizing a typical use case the place information flows into your system as spreadsheet information stories. On this case, you wish to shortly orchestrate a technique to serve this information to the correct instruments. This script imports the information from Amazon S3 right into a Pandas DataFrame. It creates a profiling report that’s exported into your S3 bucket as an HTML file. The routine cleans inaccurate info and imputes lacking values based mostly on predefined enterprise guidelines. It writes the information straight from Python shell to an Amazon Relational Database Service (Amazon RDS) for MySQL server for low-latency app response. Moreover, it exports the information right into a Parquet file and copies it into Amazon Redshift for visualization and reporting.

In our case, we deal with every state of affairs as unbiased duties with no dependency between them. You solely must create the infrastructure for the use instances that you simply wish to take a look at. Every part offers steering and hyperlinks to the documentation to arrange the mandatory infrastructure.

Stipulations

There are a couple of necessities which are frequent to all eventualities:

  1. Create an S3 bucket to retailer the enter and output information, script, and short-term information.
    blank
    Then, we create the AWS Id and Entry Administration (IAM) consumer and function essential to create and run the job.
  2. Create an IAM AWS Glue service function known as glue-blog-role and fix the AWS managed coverage AWSGlueServiceRole for common AWS Glue permissions.If you happen to’re additionally testing an Amazon Redshift or Amazon RDS use case, it’s good to grant the mandatory permission to this function. For extra info, confer with Utilizing identity-based insurance policies (IAM insurance policies) for Amazon Redshift and Id-based coverage examples for Amazon RDS.
  3. Create an IAM consumer with safety credentials and configure your AWS CLI in your native terminal.
    This lets you create and launch your scripts out of your native terminal. It is strongly recommended to create a profile related to this configuration.
    $ aws configure --profile glue-python-shell
    AWS Entry Key ID
    AWS Secret Entry Key
    Default area title
    Default output format

    The dataset used on this instance is an Excel file containing Amazon Video Evaluation information with the next construction. In a later step, we place the Excel file in our S3 bucket to be processed by our ETL script.
    blank

  4. Lastly, to work with pattern information, we’d like 4 Python modules that have been made obtainable in AWS Glue Python shell when the parameter library-set is ready to analytics:
    1. boto3
    2. awswrangler
    3. PyMySQL
    4. Pandas

Word that Amazon buyer evaluations aren’t licensed for business use. You need to substitute this information with your personal approved information supply when implementing your utility.

Load the information

On this part, you begin writing the script by loading the information utilized in all of the eventualities.

  1. Import the libraries that we’d like:
    import sys
    import io
    import os
    import boto3
    import pandas as pd
    import awswrangler as wr
    import pymysql
    import datetime
    from io import BytesIO

  2. Learn the Excel spreadsheet right into a DataFrame:
    AWS_S3_BUCKET = <your_s3_bucket_uri>
    s3 = boto3.useful resource(
        service_name="s3",
        region_name="<your_s3_region>" 
    )
    obj = s3.Bucket(AWS_S3_BUCKET).Object('amazon_reviews_us_Video.xlsx').get()
    df = pd.read_excel(io.BytesIO(obj['Body'].learn())

Situation 1: Information profiling and dataset cleansing

To help with primary information profiling, we use the pandas-profiling module and generate a profile report from our Pandas DataFrame. Pandas profiling helps output information in JSON and HTML format. On this put up, we generate an HTML output file and place it in an S3 bucket for fast information evaluation.

To make use of this new library throughout the job, add the --additional-python-modules parameter from the job particulars web page in AWS Glue Studio or throughout job creation from the AWS CLI. Bear in mind to incorporate this bundle within the imports of your script:

from pandas_profiling import ProfileReport
…
profile = ProfileReport(df)
s3.Object(AWS_S3_BUCKET,'output-profile/profile.html').put(Physique=profile.to_html())

A typical drawback that we frequently see when coping with a column’s information kind is the combination of knowledge varieties are recognized as an object in a Pandas DataFrame. Combined information kind columns are flagged by pandas-profiling as Unsupported kind and saved within the profile report description. We will entry the knowledge and standardize it to our desired information varieties.

The next strains of code loop each column within the DataFrame and test if any of the columns are flagged as Unsupported by pandas-profiling. We then forged it to string:

for col in df.columns:
    if (profile.description_set['variables'][col]['type']) == 'Unsupported':
        df[col] = df[col].astype(str)

To additional clear or course of your information, you’ll be able to entry variables offered by pandas-profiling. The next instance prints out all columns with lacking values:

for col in df.columns:
    if profile.description_set['variables'][col]['n_missing'] > 0:
        print (col, " is lacking ", profile.description_set['variables'][col]['n_missing'], " information kind ", profile2.description_set['variables'][col]['type'])
        #lacking information dealing with
        #....

Situation 2: Export information in columnar format and replica it to Amazon Redshift

On this state of affairs, we export our DataFrame into Parquet columnar format, retailer it in Amazon S3, and replica it to Amazon Redshift. We use Information Wrangler to attach our script to Amazon Redshift. This Python module is already included within the analytics setting. Full the next steps to arrange the mandatory infrastructure:

Now we are able to write uncooked information to Amazon S3 in Parquet format and to Amazon Redshift.

A typical partition technique is to divide rows by 12 months, month, and day out of your date column and apply multi-level partitioning. This method permits quick and cost-effective retrieval for all rows assigned to a selected 12 months, month, or date. One other technique to partition your information is through the use of a selected column straight. For instance, utilizing review_date as a partition provides you single degree of listing for each distinctive date and shops the corresponding information in it.

On this put up, we put together our information for the multi-level date partitioning technique. We begin by extracting 12 months, month, and day from our date column:

df['day']= pd.DatetimeIndex(df['review_date']).day.astype(str)
df['month']= pd.DatetimeIndex(df['review_date']).month.astype(str)
df['year']= pd.DatetimeIndex(df['review_date']).12 months.astype(str)

With our partition columns prepared, we are able to use the awswrangler module to put in writing to Amazon S3 in Parquet format:

wr.s3.to_parquet(
    df=df,
    path="s3://<your_output_s3_bucket>", #change this worth with path to your bucket
    dataset=True,
    mode="overwrite",       
    partition_cols=['year','month','day']

To question your partitioned information in Amazon S3, you need to use Athena, our serverless interactive question service. For extra info, confer with Partitioning information with Athena.

Subsequent, we write our DataFrame on to Amazon Redshift inner storage through the use of Information Wrangler. Writing to Amazon Redshift inner storage is suggested if you’re going to make use of this information often for advanced analytics, massive SQL operations, or enterprise intelligence (BI) reporting. In Amazon Redshift, it’s suggested to outline the distribution type and kind key on the desk to enhance cluster efficiency. If you happen to’re unsure about the correct worth for these parameters, you need to use the Amazon Redshift auto distribution type and kind key and comply with Amazon Redshift advisor suggestions. For extra info on Amazon Redshift information distribution, confer with Working with information distribution types.

#drop evaluation columns and protect different columns for evaluation
df = df.drop(['review_body','review_headline'], axis=1)

#generate dictionary with size for use by awswrangler to create varchar columns
max_length_object_cols = {col: df.loc[:, col].astype(str).apply(len).max() for col in df.select_dtypes([object]).columns}

#hook up with Redshift by way of Glue connection
con = wr.redshift.join("<your_glue_connection>")

#copy DataFrame into Redshift desk 
wr.redshift.copy(
    df=df,
    path=<temporarty path for staging information>,
    con=con,
    desk="<your_redshift_table_name>", #awswrangler will create desk if it doesn't exist
    schema="<your_redshift_schema>",
    mode="overwrite",
    iam_role=<your_iam_role_arn_with_permission_to_redshift>,
    varchar_lengths= max_length_object_cols,
	   diststyle="AUTO",
    )

#shut connection    
con.shut()

Situation 3: Information ingestion into Amazon RDS

On this state of affairs, we open a connection between AWS Glue Python shell and ingest the information straight into Amazon RDS for MySQL. The infrastructure you require for this state of affairs is an RDS for MySQL database in the identical Area because the AWS Glue Python shell job. For extra info, confer with Making a MySQL DB occasion and connecting to a database on a MySQL DB occasion.

With the PyMySQL and boto3 modules, we are able to now hook up with our RDS for MySQL database and write our DataFrame right into a desk.

Put together the variables for connection and generate a database authentication token for database login:

#RDS connection particulars
MYSQL_ENDPOINT = "<mysql_endpoint>"
PORT= "3306"
USER= "<mysql_username>"
REGION = "<region_for_rds_mysql>"
DBNAME = "<database_name>"
session = boto3.Session(profile_name="<your_aws_profile>")
consumer = session.consumer('rds')

#generate db authentication token 
token = consumer.generate_db_auth_token(DBHostname=MYSQL_ENDPOINT, Port=PORT, DBUsername=USER, Area=REGION)

#hook up with database
connection = pymysql.join(host=MYSQL_ENDPOINT,
    consumer=USER,
    password=token,
    db=DBNAME,
    ssl_ca="global-bundle.pem")
    
#prepare columns and values for SQL insert assertion    
columns=",".be part of(df.columns)
values=",".be part of(['%s'.format(i+1) for i in range(len(df.columns))])

#SQL assertion to insert into RDS
load_sql = f"INSERT INTO demo_blog.amazon_video_review({columns:}) VALUES ({values:})"

For extra details about utilizing an SSL connection along with your RDS occasion, confer with Utilizing SSL/TLS to encrypt a connection to a DB occasion.

Connect with your RDS for MySQL database and write a Pandas DataFrame into the desk with the next code:

strive:
    with connection.cursor() as cur:
        cur.executemany(load_sql, df.values.tolist())
    connection.commit()
lastly:
    cur.shut()

You want to create a desk in Amazon RDS for MySQL previous to working the insert assertion. Use the next DDL to create the demo_blog.amazon_video_review desk:

CREATE TABLE `amazon_video_review` (
  `market` varchar(100) NOT NULL,
  `customer_id` bigint NOT NULL,
  `review_id` varchar(100) DEFAULT NULL,
  `product_id` varchar(100) DEFAULT NULL,
  `product_parent` bigint NOT NULL,
  `product_title` varchar(100) DEFAULT NULL,
  `product_category` varchar(100) DEFAULT NULL,
  `star_rating` bigint NOT NULL,
  `helpful_votes` bigint NOT NULL,
  `total_votes` bigint NOT NULL,
  `vine` varchar(100) DEFAULT NULL,
  `verified_purchase` varchar(100) DEFAULT NULL,
  `review_headline` varchar(100) DEFAULT NULL,
  `review_body` varchar(5000) DEFAULT NULL,
  `review_date` date NOT NULL,
  `12 months` varchar(100) DEFAULT NULL,
  `month` varchar(100) DEFAULT NULL,
  `date` varchar(100) DEFAULT NULL,
  `day` varchar(100) DEFAULT NULL
)

When the information is accessible in database, you’ll be able to carry out a easy aggregation as follows:

agg_sql="insert into demo_blog.video_review_recap choose product_title , 12 months as review_year, rely(*) as total_review, sum(case when verified_purchase="Y" then 1 else 0 finish) as total_verified_purchase,sum(case when verified_purchase="N" then 1 else 0 finish) as total_unverified_purchase from demo_blog.amazon_video_review avr group by 1 order by 2 DESC"
cursor = connection.cursor()
cursor.execute(agg_sql)

Create and run your job

After you finalize your code, you’ll be able to run it from AWS Glue Studio or reserve it in a script .py file and submit a job with the AWS CLI. Bear in mind so as to add the mandatory parameters in your job creation relying of the state of affairs you’re testing. The next job parameters cowl all of the eventualities:

--command pythonVersion=3.9 …
--default-arguments '{"library-set":"analytics" , "--additional-python-modules":"pandas_profile", …}'

Evaluation the outcomes

On this part, we evaluation the anticipated outcomes for every state of affairs.

In Situation 1, pandas-profiling generates a knowledge report in HTML format. On this report, you’ll be able to visualize lacking values, duplicated values, measurement estimations, or correlations between columns, as proven within the following screenshots.

blank blank blank

For Situation 2, you’ll be able to first evaluation the Parquet file written to Amazon S3 in Parquet format with partition 12 months/month/day.

blank

Then you need to use the Amazon Redshift question editor to question and visualize the information.

blank

For Situation 3, you need to use a JDBC connection or database IDE to hook up with your RDS database and question the information that you simply simply ingested.

blank

Clear up

AWS Glue Python shell is a serverless routine that received’t incur in any additional fees when it isn’t working. Nonetheless, this demo used a number of companies that can incur in additional prices. Clear up after finishing this walkthrough with the next steps:

  1. Take away the contents of your S3 bucket and delete it. If you happen to encounter any errors, confer with Why can’t I delete my S3 bucket utilizing the Amazon S3 console or AWS CLI, even with full or root permissions.
  2. Cease and delete the RDS DB occasion. For directions, see Deleting a DB occasion.
  3. Cease and delete the Amazon Redshift cluster. For directions, confer with Deleting a cluster.

Conclusion

On this put up, we launched AWS Glue Python shell with Python 3.9 help and extra pre-loaded libraries. We introduced the customizable Python shell setting with pre-loaded libraries and PIP help to put in different native or customized Python libraries. We lined the brand new options and learn how to get began by way of AWS Glue Studio and the AWS CLI. We additionally demonstrated a step-by-step tutorial of how one can simply use these new capabilities to perform frequent ETL use instances.

To be taught extra about AWS Glue Python shell and this new function, confer with Python shell jobs in AWS Glue.


Concerning the authors

blankAlunnata Mulyadi is an Analytics Specialist Options Architect at AWS. Alun has over a decade of expertise in information engineering, serving to clients deal with their enterprise and technical wants. Outdoors of the work, he enjoys images, biking, and basketball.

blankQuim Bellmunt is an Analytics Specialist Options Architect at Amazon Net Companies. Quim has a PhD in Laptop Science and Data Graph specializing in information modeling and transformation. With over 6 years of hands-on expertise within the analytics and AI/ML house, he enjoys serving to clients create programs that scale with their enterprise wants and generate worth from their information. Outdoors of the work, he enjoys strolling along with his canine and biking.

blankKush Rustagi is a Software program Improvement Engineer on the AWS Glue crew with over 4 years of expertise within the business having labored on large-scale monetary programs in Python and C++, and is now utilizing his scalable system design expertise in the direction of cloud improvement. Earlier than engaged on Glue Python Shell, Kush labored on anomaly detection challenges within the fin-tech house. Apart from exploring new applied sciences, he enjoys EDM, touring, and studying non-programming languages.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments