Skip to content Skip to sidebar Skip to footer

List All Files In Hdfs Python Without Pydoop

I have a hadoop cluster running on centos 6.5. I am currently using python 2.6. For unrelated reasons i can't upgrade to python 2.7. Due to this unfortunate fact i cannot install p

Solution 1:

If pydoop doesn't work, you can try the Snakebite library which should work with Python 2.6. Another option is enabling WebHDFS API and using that directly with requests or something similar.

print requests.get("http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS").json()

With Snakebite:

from snakebite.client import Client
client = Client("localhost", 8020, use_trash=False)
for x in client.ls(['/']):
    print x

Solution 2:

I would suggest checking out hdfs3

>>>from hdfs3 import HDFileSystem>>>hdfs = HDFileSystem(host='localhost', port=8020)>>>hdfs.ls('/user/data')>>>hdfs.put('local-file.txt', '/user/data/remote-file.txt')>>>hdfs.cp('/user/data/file.txt', '/user2/data')

Like Snakebite, hdfs3 use protobufs for communication and bypasses the JVM. Unlike Snakebite, hdfs3 offers kerberos support

Solution 3:

I would recommend this Python project: https://github.com/mtth/hdfs It uses HttpFS and it's actually quite simple and fast. I've been using it on my cluster with Kerberos and works like a charm. You just need to set the namenode or HttpFs service URL.

Post a Comment for "List All Files In Hdfs Python Without Pydoop"