Merge json files using Pandas

Quick demo for merging multiple json files using Pandas –

import pandas as pd
import glob
import json

file_list = glob.glob("*.json")
>>> file_list
['b.json', 'c.json', 'a.json']

Use enumerate to assign counter to files.


allFilesDict = {v:k for v, k in enumerate(file_list, 1)}
>>> allFilesDict
{1: 'b.json', 2: 'c.json', 3: 'a.json'}

Append the data into list –

>>> data = []

for k,v in allFilesDict.items():
    if 1  (more...)

Pandas – ValueError: If using all scalar values, you must pass an index

Reading json file using Pandas read_json can fail with “ValueError: If using all scalar values, you must pass an index”. Let see with an example –

cat a.json
{
  "creator": "CaptainAmerica",
  "last_modifier": "NickFury",
  "title": "Captain America: The First Avenger",
  "view_count": 12000
}
>>> import pandas as pd
>>> import glob
>>> for f in glob.glob('*.json'):
...     print(f)
...
b.json
c.json
a.json
>>> pd.read_json('a.json')
Traceback (most recent call last):
  File  (more...)

S3 – fatal error: An error occurred (404) when calling the HeadObject operation

Make sure to use –recursive parameter.

[desktop: test]:${PWD}> aws s3 cp s3://demo-beta/dw/user/dt=2019-07-30/ /tmp/dw/
fatal error: An error occurred (404) when calling the HeadObject operation: Key "dw/user/dt=2019-07-30/" does not exist

[desktop: test]:${PWD}> aws s3 cp s3://demo-beta/dw/user/dt=2019-07-30/ /tmp/dw/ --recursive
download: s3://demo-beta/dw/user/dt=2019-07-30/part-00002-fd866c-238-489-a44-739f1d04-c000.snappy.parquet to ../../../tmp/dw/part-00002-fd866c-238-489-a44-739f1d04-c000.snappy.parquet

From Documentation –

–recursive (boolean) Command is performed on all files or objects under the specified directory or prefix.

Python – sort() vs sorted(list)

You can compare list using sort() or sorted(list), but be careful with sort() –

>>> c = [('d',4), ('c',3), ('a',1), ('b', 2)]
>>> a = [('a',1), ('b', 2), ('c',3), ('d',4)]
>>> a.sort() == c.sort()
True
>>>
>>> a = [('a',1), ('b', 2), ('c',3), ('d',4)]
>>> b = [('b',2), ('c', 3), ('a',1)]
>>>
>>> a.sort() == b.sort()
True

>>> a = [('a',1), ('b', 2), ('c',3), ('d',4)]
>>> b = [('b',2), ('c', 3),  (more...)

Python – str.maketrans()

Working on a Python code, I had a requirement for removing the single/double quotes and open/close brackets from the string of below format —

>>> text = """with summary as (select '
...  'p.col1,p.col2,p.col3, ROW_NUMBER() '
...  'OVER(PARTITION BY p.col1,p.col3 ORDER BY '
...  'p.col2) AS rk from (select * from (select '
...  'col2, col1, col3, '
...  'sum(col4) as col6 from '
...  '"demo"."tab1" a join '
...  "(select lpad(col5, 12, '0') as  (more...)

Expanding array to multiple rows – Athena

A single row in Athena table is stored as —

select id, course, date from demo.course_tab where id='1234567892'

 	id	course	                                                                 date
 1234567892	[95c3c1bc5873, 2e345b2eb678, 027b02599f4a, 8695a580520b, 5d453355d415, cdcc7682070b]	2019-06-13

The datatype for course column is array(string). Now, how can you get the output in below format –

 	id	        course          date

1	1234567892	95c3c1bc5873	2019-06-13
2	1234567892	2e345b2eb678	2019-06-13
3	1234567892	027b02599f4a	2019-06-13
4	1234567892	8695a580520b	2019-06-13
5	1234567892	5d453355d415	2019-06-13
6	1234567892	cdcc7682070b	2019-06-13

(more...)

S3 bucket size using aws cli

To get the total objects count and total size for folder within S3 bucket recursively –


aws s3 ls --summarize --human-readable --recursive s3://content-dev/mywork/dumb | grep 'Total'

Total Objects: 1922
   Total Size: 297.4 GiB

named tuple to JSON – Python

In pgdb – PostgreSQL DB API, the cursor which is used to manage the context of a fetch operation returns list of named tuples. These named tuples contain field names same as the column names of the database query.

An example of a row from the list of named tuples –


Row(log_time=datetime.datetime(2019, 3, 20, 5, 41, 29, 888000, tzinfo=), user_name='admin', connection_from='72.20.208.64:21132', command_tag='INSERT', message='AUDIT: SESSION,1,1,WRITE,INSERT,TABLE,user.demodml,"insert into user.demodml (id) values (1),(2),(3),(4),(5),(6),(7),(8),(9),(11);",',  (more...)

PostgreSQL – LWLock:multixact_offset

On one of the PostgreSQL db , the active sessions were waiting on IO:SLRURead and LWLock:multixact_offset causing application latency.

Multioffset

As per PostgreSQL doc, SLRURead is waiting for a read of an SLRU page and multixact_offset is waiting for I/O on a multixact offset buffer.

Diagnosing the issue took some time so cut to the chase

1. Identified the SQLs, relation (and its size) for which these wait events were high. These table had ~1500 DMLs (more...)

Python List

This blog post is about appending data elements to list in Python.

Suppose we have a simple list “x”, we will look at different ways to append elements to this list.

x = [1, 2, 3]

The “append” method appends only a single element

>>> x
[1, 2, 3]
>>> x.append(4)
>>> x
[1, 2, 3, 4]
>>>

>> x.append(5, 6, 7)
Traceback (most recent call last):
File "", line 1, in
TypeError:  (more...)

Happy New Year 2019

Wish you all a very Happy, Healthy and Successful New Year 2019. Have a blessed year ahead!

Cheers!!

PostgreSQL – CPU Utilization and Index

One of the Production Aurora PostgreSQL instance running on db.r4.16xlarge instance (64 vCPU and 488 GB ) was reporting high CPU Utilization spiking upto 100%.

Screen Shot 2018-12-09 at 11.14.49 AM

With such issues, one of the first thing is to look for the SQLs with high buffers shared hit. I have built a small tool called pgsnap which is something similar to AWR respostory in Oracle maintaining the SQL stat history.  So, with pg_stat_statements and hist_pg_stat_statements(that’s what (more...)

Python – Flatten List of Lists

Itertools is one of the most powerful module in Python. Today I had requirement to flatten list of lists and itertools made it so easy.

My list —

>> val = [['a','b'],'c',['d','e','f']]

Required Result

['a', 'b', 'c', 'd', 'e', 'f']

How do you do it? Itertools to the resuce —

>>> list(chain.from_iterable(val))
['a', 'b', 'c', 'd', 'e', 'f']

So simple !!

Python — Enumerate

Suppose you have a dataset (data) and want to find every 5th item.  How would you do it?

data = [ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

The first thing which could come to mind in using slice, but that won’t work as its based on index and index starts from 0.

>>> data[::5]
[1, 11]

The answer should be = [9, 19]

This is where enumerate comes in play. (more...)

AWS DMS – Target TableName Differs

AWS DMS is a tool that supports both homogenous and heterogeneous migration, helping to migrate to aws cloud.

During most of the migrations, the source and target table names remain the same, in which case the Mappings.json file is pretty simple. As an example (Oracle to PostgreSQL)

 {
    "rules":
    [
        {
            "rule-type": "selection",
            "rule-id": "1",
            "rule-name": "1",
            "object-locator":
            {
                "schema-name": "DEVO",
                "table-name": "TEST_DEMO"
            },
            "rule-action": "include"
        },
        {
            "rule-type": "transformation",
            "rule-id": "2",
            "rule-name": "2",
             (more...)