pyspark.sql.DataFrame.unionByName¶

DataFrame.unionByName(other: pyspark.sql.dataframe.DataFrame, allowMissingColumns: bool = False) → pyspark.sql.dataframe.DataFrame[source]¶

Returns a new DataFrame containing union of rows in this and another DataFrame.

This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by distinct().

New in version 2.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

otherDataFrame: Another DataFrame that needs to be combined.
allowMissingColumnsbool, optional, default False: Specify whether to allow missing columns.

New in version 3.1.0.

Returns

DataFrame: Combined DataFrame.

Examples

The difference between this function and union() is that this function resolves columns by name (not by position):

>>>>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
>>> df1.unionByName(df2).show()
+----+----+----+
|col0|col1|col2|
+----+----+----+
|   1|   2|   3|
|   6|   4|   5|
+----+----+----+

When the parameter allowMissingColumns is True, the set of column names in this and other DataFrame can differ; missing columns will be filled with null. Further, the missing columns of this DataFrame will be added at the end in the schema of the union result:

>>>>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"])
>>> df1.unionByName(df2, allowMissingColumns=True).show()
+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
|   1|   2|   3|null|
|null|   4|   5|   6|
+----+----+----+----+

pyspark.sql.DataFrame.unionAll pyspark.sql.DataFrame.unpersist