As of September 10, 2017, Amazon has made major changes to the retrieval and accessibility of the IMDB raw dataset. The dataset has been moved from their now-defunct public FTP servers to their proprietary cloud computing platform, Amazon Web Services (AWS). (More info here)
The raw IMDb dataset is now only accessible through the AWS simple storage service (s3). To access the data, a user must register an account with AWS and make a requester-pays request to the imdb-datasets bucket via s3 to download the IMDb dataset. The dataset is now updated daily, an increase from the original monthly updates. Users also must pay for transfers if the monthly 5GB transfer limit is exceeded.
Furthermore, Amazon has also made big changes to the format of the data itself. Many data fields present in the original IMDB dataset are now either simplified or missing entirely. Admittedly, the data is now easier to read, but at the cost of data richness and depth of information.
What does this mean for the AgensGraph IMDb Import Project?
The AgensGraph import project heavily relies on the open source project IMDbPy to import relational data from the raw IMDB dataset before transforming the relational data into graph data. With the latest format restructuring of IMDb data, a key script in the AgensGraph importing process, imdbpy2sql.py is no longer compatible with the imdb dataset. This means we currently do not have a way to reliably import the newly formatted IMDb data into AgensGraph.
Davide Alberani, creator of IMDbPy, has expressed that in its current state, the future of the IMDbPy project is uncertain. IMDbPy’s parser would need to be modified to be compatible with the new data format.
A new graph schema would be necessary as well. A change in the relational schema would also require a change in the graph schema of the database. The newer graph schema would be simplified as well, much like the simplified relational schema.
Changes to the IMDb raw dataset means that the IMDb AgensGraph import will need to change too. With IMDbPy no longer compatible and a changed relational schema, there is a lot of update work to be done for the AgensGraph import. We will be looking into alternative methods to import movie data into AgensGraph and defining a new graph schema for said data, or maybe looking to import information from another database altogether.
비트나인, 그래프 데이터베이스 전문 기업