Bagaimana cara membaca file baris json terkompresi gzip ke dalam kerangka data PySpark?

Saya memiliki file baris JSON yang ingin saya baca ke dalam bingkai data PySpark. file dikompresi dengan gzip.

Nama filenya terlihat seperti ini: file.jl.gz

Saya tahu cara membaca file ini ke dalam bingkai data pandas:

df= pd.read_json('file.jl.gz', lines=True, compression='gzip)

Saya baru mengenal pyspark, dan saya ingin mempelajari pyspark yang setara dengan ini. Apakah ada cara untuk membaca file ini ke dalam kerangka data pyspark?

EDIT 2

%pyspark
df=spark.read.option('multiline','true').json("s3n:AccessKey:secretkey@bucketname/ds_dump_00000.jl.gz")

Saya menjalankan perintah di atas & mendapatkan kesalahan ini.

Fail to execute line 1: 
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o130.json.
: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 
    at org.apache.hadoop.fs.Path.initialize(Path.java:259)
    at org.apache.hadoop.fs.Path.<init>(Path.java:217)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:560)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:559)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:355)
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:559)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
    at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:411)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
    at java.net.URI.checkPath(URI.java:1823)
    at java.net.URI.<init>(URI.java:745)
    at org.apache.hadoop.fs.Path.initialize(Path.java:256)
    ... 24 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8814422034403105951.py", line 380, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 274, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI:'

person sachin kumar s    schedule 27.12.2020    source sumber
comment
membaca dokumentasi bukanlah pilihan bagi Anda, mengapa? spark.apache.org/docs/latest/sql-data- sumber-json.html - tidak ada keajaiban dengan file terkompresi, itu hanya akan didekompresi berdasarkan ekstensi file.   -  person UninformedUser    schedule 27.12.2020


Jawaban (1)


df=spark.read.option('multiline','true').json('file.jl.gz')

Itu selalu lebih baik untuk memberikan skema saat membaca json yang kompleks

person Aditya Vikram Singh    schedule 27.12.2020