Can you modify data stored on S3?

Yes, you can! By using transactional Hive ORC tables

3 min readJul 16, 2020

Answer

One question that’s often asked is — “How can I modify or delete data that is on S3 or IBM Cloud Object Storage?” The answer is surprisingly simple. You can do that with the following caveats:
- It works only with Hive Transactional tables
- It is supported only for Hive ORC format
- It is supported only from Hive. That is — it is not supported from Spark. So you cannot use Spark SQL to create or work with these tables. Yet.
- It is supported only for Managed tables

Hive Transactional Tables

Apache Hive supports transactions — which means you can do CRUD operations. You can denote a table as transactional by using this property TBLPROPERTIES('transactional'='true')

ORC Format

ORC or Optimized Row Columnar format is a highly performant data format. It is the most popular after parquet format. You can denote a table to use the ORC format by specifying like this :STORED AS ORC

Complete Working Example

-- Create Table
CREATE TABLE employees_s3a (empid int, empname string, empband int) STORED AS ORC 
LOCATION 's3a://matrix/employeestxn'
TBLPROPERTIES ('transactional' = 'true');-- Insert data into table
INSERT INTO employees_s3a values (898989,'AAAAA',8);
INSERT INTO employees_s3a values (767676,'BBBBB',7);
INSERT INTO employees_s3a values (545454,'CCCCC',6);-- Select data from the table
SELECT * from employees_s3a;
+-------------------+---------------------+--------------------+
|employees_s3a.empid|employees_s3a.empname|employees_s3a.empband  
+-------------------+---------------------+--------------------+
| 898989            | AAAAA               |  8                
| 767676            | BBBBB               |  7                
| 545454            | CCCCC               |  6                  
+-------------------+---------------------+-------------------+
3 rows selected (7.382 seconds)
-- DELETE data from table 
DELETE FROM employees_s3a WHERE empname='BBBBB';
UPDATE employees_s3a SET empband=10 where empid=898989;+-------------------+----------------------+--------------------+
|employees_s3a.empid| employees_s3a.empname|employees_s3a.empband
+-------------------+----------------------+--------------------+
| 545454            |  CCCCC               |  6                
| 898989            |  AAAAA               |  10                 
+-------------------+----------------------+--------------------+
2 rows selected (9.065 seconds)

Wait! Hold on! Don’t we need an EXTERNAL qualifier for data on S3?

No! That’s the last caveat. This works only with MANAGED hive tables. Which means you cannot use EXTERNAL tables. This is often confusing to users. It is not necessary to use EXTERNAL tables for data on S3 or COS. You just need to specify that through the LOCATION keyword as shown above.

The difference between Managed tables and External tables is that for the former, Hive takes care of managing the underlying data along with the metadata. Whereas for the latter, Hive only takes care of the metadata and does not disturb the underlying data. For example, if you DROP a managed table, it will delete the data as well. Whereas if you DROP an external table, only the schema/metadata will be removed from Hive. The data still remains on the underlying system — be it in HDFS or S3.

Try it out!

This is a simple concept — but the question has been asked far too many times to warrant this short blog.