Can you modify data stored on S3?


One question that’s often asked is — “How can I modify or delete data that is on S3 or IBM Cloud Object Storage?” The answer is surprisingly simple. You can do that with the following caveats:
- It works only with Hive Transactional tables
- It is supported only for Hive ORC format
- It is supported only from Hive. That is — it is not supported from Spark. So you cannot use Spark SQL to create or work with these tables. Yet.
- It is supported only for Managed tables

Hive Transactional Tables

Apache Hive supports transactions — which means you can do CRUD operations. You can denote a table as transactional by using this property TBLPROPERTIES('transactional'='true')

ORC Format

ORC or Optimized Row Columnar format is a highly performant data format. It is the most popular after parquet format. You can denote a table to use the ORC format by specifying like this :STORED AS ORC

Complete Working Example

-- Create Table
CREATE TABLE employees_s3a (empid int, empname string, empband int) STORED AS ORC
LOCATION 's3a://matrix/employeestxn'
TBLPROPERTIES ('transactional' = 'true');
-- Insert data into table
INSERT INTO employees_s3a values (898989,'AAAAA',8);
INSERT INTO employees_s3a values (767676,'BBBBB',7);
INSERT INTO employees_s3a values (545454,'CCCCC',6);
-- Select data from the table
SELECT * from employees_s3a;
| 898989 | AAAAA | 8
| 767676 | BBBBB | 7
| 545454 | CCCCC | 6
3 rows selected (7.382 seconds)

-- DELETE data from table
DELETE FROM employees_s3a WHERE empname='BBBBB';
UPDATE employees_s3a SET empband=10 where empid=898989;
|employees_s3a.empid| employees_s3a.empname|employees_s3a.empband
| 545454 | CCCCC | 6
| 898989 | AAAAA | 10
2 rows selected (9.065 seconds)

Wait! Hold on! Don’t we need an EXTERNAL qualifier for data on S3?

No! That’s the last caveat. This works only with MANAGED hive tables. Which means you cannot use EXTERNAL tables. This is often confusing to users. It is not necessary to use EXTERNAL tables for data on S3 or COS. You just need to specify that through the LOCATION keyword as shown above.

The difference between Managed tables and External tables is that for the former, Hive takes care of managing the underlying data along with the metadata. Whereas for the latter, Hive only takes care of the metadata and does not disturb the underlying data. For example, if you DROP a managed table, it will delete the data as well. Whereas if you DROP an external table, only the schema/metadata will be removed from Hive. The data still remains on the underlying system — be it in HDFS or S3.

Try it out!

This is a simple concept — but the question has been asked far too many times to warrant this short blog.




Senior Consultant, IBM Cloud. Sharing titbits of epiphanies...

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Perform Systemic Threat Modelling — An Example

Introducing zilWatchBot

Serverless — AWS Lambda Layers and Python Dependencies

“Scrum doesn’t have Project Managers, so we ignore them in our organisation”

Being a Computer Technician

Chapter 10 Zen of Programming Concurrency

Learn Python -20 Basics Concepts

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mrudula Madiraju

Mrudula Madiraju

Senior Consultant, IBM Cloud. Sharing titbits of epiphanies...

More from Medium

Part 1 [Reliability Engineering]: Every system will fail eventually!

Data Dynamics’ StorageX and Azure Offer a Zero-Cost Migration Path to Cloud Intelligent, Efficient…

WWH of Traditional File Formats to Modern File formats with modern file Storages — DELL/EMC ECS…

Apache airflow dynamic parallel task creation