Cross-account AWS Glue Data Catalog access with Glue ETL

Uncategorized
| May 16, 2020

To process data in AWS Glue ETL, DataFrame or DynamicFrame is required. A DataFrame is similar to a table and supports functional-style (map/reduce/filter/etc.) along with SQL operations. The AWS Glue DynamicFrame is similar to DataFrame, except that each record is self-describing, so no schema is required initially. It computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type.

DynamicFrame can be created using the below options –

Implementing Glue ETL job with Job Bookmarks

Uncategorized
| May 7, 2020

AWS Glue is a fully managed ETL service to load large amounts of datasets from various sources for analytics and data processing with Apache Spark ETL jobs.

In this post I will discuss the use of AWS Glue Job Bookmarks feature in the following architecture.

For_Blog

AWS Glue Job Bookmarks help Glue maintain state information of the ETL job and helps process new data when rerunning on a scheduled interval, preventing the reprocess of old data. (more...)

Aurora MySQL – Export data to S3

Uncategorized
| Apr 16, 2020

Using SELECT INTO OUTFILE S3 you can query data from an Aurora MySQL DB cluster and save it directly into text files stored in S3 bucket.

1. Create an IAM policy for S3.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:DeleteObject",
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::bucket-name",
        "arn:aws:s3:::bucket-name/*"
      ]
    },
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": [
        "s3:ListAllMyBuckets",
        "s3:HeadBucket"
      ],
      "Resource": "*"
    }
  ]
}

2. Create (more...)

Reading Parquet files with AWS Lambda

Uncategorized
| Apr 14, 2020

I had a use case to read data (few columns) from parquet file stored in S3, and write to DynamoDB table, every time a file was uploaded. Thinking to use AWS Lambda, I was looking at options of how to read parquet files within lambda until I stumbled upon AWS Data Wrangler.

From the document –

What is AWS Data Wrangler?

An open-source Python package that extends the power of Pandas library to AWS connecting (more...)

Scan Dynamodb table from AWS Glue in different account

Uncategorized
| Mar 28, 2020

In this blog post I will list down the steps required to setup the AWS Glue job to scan the dynamodb table in another account. In my setup, I scan the dynamodb table in Account A (us-west-2), perform glue transformations in Account B (us-east-1) and write it to S3 in Account B.

Account A – Dynamodb table
Account B – AWS Glue Job, S3 Bucket

1. Create an IAM role in Account B (us-east-1) with (more...)

AWS Glue – Querying Nested JSON with Relationalize Transform

Uncategorized
| Feb 26, 2020

AWS Glue has transform Relationalize that can convert nested JSON into columns that you can then write to S3 or import into relational databases. As an example –

Initial Schema:

>>> df.printSchema()
root
|-- Id: string (nullable = true)
|-- LastUpdated: long (nullable = true)
|-- LastUpdatedBy: string (nullable = true)
|-- Properties: struct (nullable = true)
|    |-- choices: string (nullable = true)
|    |-- database: string (nullable = true)
|    |-- object:  (more...)

Redshift: Convert TEXT to Timestamp

Uncategorized
| Feb 18, 2020

How do you convert TEXT to timestamp in redshift?

If the score column has data in given format, how can you display the timestamp.

{"Choices":null, "timestamp":"1579650266955", "scaledScore":null}
select cast(json_extract_path_text(score, 'timestamp') as timestamp) from schema.table limit 10;

This sql will fail with —

ERROR:  Invalid data
DETAIL:
  -----------------------------------------------
  error:  Invalid data
  code:      8001
  context:   Invalid format or data given: 1579650266955
  query:     2057693
  location:  funcs_timestamp.cpp:261
  process:   query1_120_2057693 [pid=6659]
  -----------------------------------------------

In order to extract the timestamp (more...)

Usecase with RDS Snapshot Export to S3

Uncategorized
| Feb 16, 2020

AWS recently announced “Amazon RDS Snapshot Export to S3” feature wherein you can now export Amazon Relational Database Service (Amazon RDS) or Amazon Aurora snapshots to Amazon S3 as Apache Parquet, an efficient open columnar storage format for analytics.

I had a use-case to refresh Athena tables daily with full data set in Account B(us-east-1) from Aurora MySQL database running under Private subnet in Account A (us-west-2). The two solutions I could think (more...)

Athena: Extracting data from JSON

Uncategorized
| Feb 15, 2020

Suppose you have a table in Athena and its column contain JSON data. How can you extract the individual keys? In the example, the table has column “fixedproperties” which contain JSON data –

athena_column_result

How can you display the data is below format?

column_data_seperated

select
    json_extract(fixedproperties, '$.objectId') as object_id,
    json_extract(fixedproperties, '$.custId') as cust_id,
    json_extract(fixedproperties, '$.score') as score
from json_demo;

For more examples on extracting data from JSON refer to https://docs.aws.amazon.com/athena/latest/ug/extracting-data-from-JSON.html

Athena – SQL to get date of next Monday

Uncategorized
| Nov 18, 2019

I was recently asked how to get date of next Monday irrespective of which day of the week sql is executed. So thought to share it, in-case someone else has such requirement.

select date_add('day', 8 - extract(day_of_week from current_date), current_date)

 
Or,

select date_trunc('week', current_date) + interval '7' day;

Happy learning 🙂

 

Pandas Scratchpad – I

Uncategorized
| Oct 6, 2019

This blog is scratchpad for day-to-day Pandas commands.

pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

1. Few quick ways to create Pandas DataFrame

DataFrame from Dict of List –

df_from_dict_of_list

DataFrame from List of List –

df_from_list_of_list

DataFrame from List of Dict –

df_from_list_of_dict

DataFrame using zip function –

df_using_zip

data = {'Name':['Iron Man', 'Deadpool', 'Captian America', Thor', 'Hulk', 'Spider Man'], 'Age':[48, 30, 100, 150, 50,  (more...)

Python – Counter – Compare Lists

Uncategorized
| Aug 30, 2019

A few days back I wrote about using sorted(list) to compare 2 list.

Recently I learned we can also use Counter to compare list without taking their order into account.

Counter_Compare_List

Happy Learning !!

Merge json files using Pandas

Uncategorized
| Aug 25, 2019

Quick demo for merging multiple json files using Pandas –

import pandas as pd
import glob
import json

file_list = glob.glob("*.json")
>>> file_list
['b.json', 'c.json', 'a.json']

Use enumerate to assign counter to files.


allFilesDict = {v:k for v, k in enumerate(file_list, 1)}
>>> allFilesDict
{1: 'b.json', 2: 'c.json', 3: 'a.json'}

Append the data into list –

>>> data = []

for k,v in allFilesDict.items():
    if 1  (more...)

Pandas – ValueError: If using all scalar values, you must pass an index

Uncategorized
| Aug 13, 2019

Reading json file using Pandas read_json can fail with “ValueError: If using all scalar values, you must pass an index”. Let see with an example –

cat a.json
{
  "creator": "CaptainAmerica",
  "last_modifier": "NickFury",
  "title": "Captain America: The First Avenger",
  "view_count": 12000
}
>>> import pandas as pd
>>> import glob
>>> for f in glob.glob('*.json'):
...     print(f)
...
b.json
c.json
a.json
>>> pd.read_json('a.json')
Traceback (most recent call last):
  File  (more...)

S3 – fatal error: An error occurred (404) when calling the HeadObject operation

Uncategorized
| Aug 5, 2019

Make sure to use –recursive parameter.

[desktop: test]:${PWD}> aws s3 cp s3://demo-beta/dw/user/dt=2019-07-30/ /tmp/dw/
fatal error: An error occurred (404) when calling the HeadObject operation: Key "dw/user/dt=2019-07-30/" does not exist

[desktop: test]:${PWD}> aws s3 cp s3://demo-beta/dw/user/dt=2019-07-30/ /tmp/dw/ --recursive
download: s3://demo-beta/dw/user/dt=2019-07-30/part-00002-fd866c-238-489-a44-739f1d04-c000.snappy.parquet to ../../../tmp/dw/part-00002-fd866c-238-489-a44-739f1d04-c000.snappy.parquet

From Documentation –

–recursive (boolean) Command is performed on all files or objects under the specified directory or prefix.