Realtime Processing With Apache Spark

Name: _____________________

Date: _____________________

Instructions: Answer all questions. Write your answers clearly in the space provided.

Question 1:

Point out the wrong statement.

A. The major difference between Hadoop and Hama is map/reduce tasks can't communicate with each other
B. Hama follows master/slave pattern
C. A JobTracker maps to a BSPMaster, TaskTracker maps to a GroomServer and Map/Reduce task maps to a BSPTask
D. All of the mentioned
E. ConcurScheduler detects whether the index is on SSD or not
F. Memory index supports payloads
G. Auto-IO-throttling has been added to ConcurrentMergeScheduler, to rate limit IO writes for each merge depending on incoming merge rate
H. The default codec has an option to control BEST_SPEED or BEST_COMPRESSION for stored fields
I. Spark is intended to replace, the Hadoop stack
J. Spark was designed to read and write data from and to HDFS, as well as other storage systems
K. Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN
L. None of the mentioned
M. Hadoop is a prerequisite for Drill
N. Drill tackles rapidly evolving application driven schemas and nested data structures
O. Drill provides a single interface for structured and semi-structured data allowing you to readily query JSON files and HBase tables as easily as a relational table
P. All of the mentioned
Q. 'Taste' collaborative-filtering recommender component of Mahout was originally a separate project and can run standalone without Hadoop
R. Integration of Mahout with initiatives such as the Pregel-like Giraph are actively under discussion
S. Calculating the LLR is very straightforward
T. None of the mentioned
U. Version 1.4.0 is the fourth Flume release as an Apache top-level project
V. Apache Flume 1.5.2 is a security and maintenance release that disables SSLv3 on all components in Flume that support SSL/TLS
W. Flume is backwards-compatible with previous versions of the Flume 1.x codeline
X. None of the mentioned
Y. Crunch pipeline written by the development team sessionizes a set of user logs generates are then processed by a diverse collection of Pig scripts and Hive queries
Z. Crunch pipelines provide a thin veneer on top of MapReduce
[. Developers have access to low-level MapReduce APIs
\. None of the mentioned
]. PyLucene is a Lucene port
^. PyLucene embeds a Java VM with Lucene into a Python process
_. The PyLucene Python extension, a Python module called lucene is machine-generated by JCC
`. PyLucene is built with JCC
a. For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS)
b. Spark also supports a pseudo-distributed mode, usually used only for development or testing purposes
c. Spark has over 465 contributors in 2014
d. All of the mentioned
e. The Hive metastore lets you create tables without specifying a database
f. Restrictions apply to the types of columns HCatLoader can read from HCatalog-managed tables
g. If the table is partitioned, you can indicate which partitions to scan by immediately following the load statement with a partition filter statement
h. None of the mentioned
i. Apache Hama is not a pure Bulk Synchronous Parallel Engine
j. Hama uses the Hadoop Core for RPC calls
k. Apache Hama is optimized for massive scientific computations such as matrix, graph and network algorithms
l. Hama is a relatively newer project than Hadoop
m. With Thrift, it is not possible to define a service and change the protocol and transport without recompiling the code
n. Thrift includes server infrastructure to tie protocols and transports together, like blocking, non-blocking, and multi threaded servers
o. Thrift supports a number of protocols for service definition
p. None of the mentioned
q. The original name of WebHCat was Templeton
r. Robert in client management uses Hive to analyze his clients' results
s. With HCatalog, HCatalog cannot send a JMS message that data is available
t. All of the mentioned
u. HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools
v. There is Hive-specific interface for HCatalog
w. Data is defined using HCatalog's command line interface (CLI)
x. All of the mentioned
y. DoFns also have a number of helper methods for working with Hadoop Counters, all named increment
z. The Crunch APIs contain a number of useful subclasses of DoFn that handle common data processing scenarios and are easier to write and test
{. FilterFn class defines a single abstract method
|. None of the mentioned
}. There are no XML configuration files in Thrift
~. Thrift gives cross-language serialization with lower overhead than alternatives such as SOAP due to use of binary format
. No framework to code is a feature of Thrift
�. None of the mentioned
Answer: _________
Question 2:

. . . . . . . . generates NGrams and counts frequencies for ngrams, head and tail subgrams.

A. CollocationDriver
B. CollocDriver
C. CarDriver
D. All of the mentioned
Answer: _________
Question 3:

Which of the following Hive commands is not supported by HCatalog?

A. ALTER INDEX ... REBUILD
B. CREATE VIEW
C. SHOW FUNCTIONS
D. DROP TABLE
Answer: _________
Question 4:

A float parameter, defaults to 0.0001f, which means we can deal with 1 error every . . . . . . . . rows.

A. 1000
B. 10000
C. 1 million rows
D. None of the mentioned
Answer: _________
Question 5:

. . . . . . . . Collection API allows for even distribution of custom replica properties.

A. BALANUNIQUE
B. BALANCESHARDUNIQUE
C. BALANCEUNIQUE
D. None of the mentioned
Answer: _________
Question 6:

Spark runs on top of . . . . . . . . a cluster manager system which provides efficient resource isolation across distributed applications.

A. Mesjs
B. Mesos
C. Mesus
D. All of the mentioned
Answer: _________
Question 7:

Which of the following can be used to launch Spark jobs inside MapReduce?

A. SIM
B. SIMR
C. SIR
D. RIS
Answer: _________
Question 8:

. . . . . . . . transport is required when using a non-blocking server.

A. TZlibTransport
B. TFramedTransport
C. TMemoryTransport
D. None of the mentioned
Answer: _________
Question 9:

Drill also provides intuitive extensions to SQL to work with . . . . . . . . data types.

A. simple
B. nested
C. int
D. all of the mentioned
Answer: _________
Question 10:

. . . . . . . . sink can be a text file, the console display, a simple HDFS path, or a null bucket where the data is simply deleted.

A. Collector Tier Event
B. Agent Tier Event
C. Basic
D. None of the mentioned
Answer: _________
Question 11:

SolrJ now has first class support for . . . . . . . . API.

A. Compactions
B. Collections
C. Distribution
D. All of the mentioned
Answer: _________
Question 12:

The tokens are passed through a Lucene . . . . . . . . to produce NGrams of the desired length.

A. ShngleFil
B. ShingleFilter
C. SingleFilter
D. Collfilter
Answer: _________
Question 13:

Point out the correct statement.

A. Drill provides plug-and-play integration with existing Apache Hive
B. Developers can use the sandbox environment to get a feel for the power and capabilities of Apache Drill by performing various types of queries
C. Drill is inspired by Google Dremel
D. None of the mentioned
E. Mahout is distributed under a commercially friendly Apache Software license
F. Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm
G. Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms
H. None of the mentioned
I. Flume is a distributed, reliable, and available service
J. Version 1.5.2 is the eighth Flume release as an Apache top-level project
K. Flume 1.5.2 is production-ready software for integration with hadoop
L. All of the mentioned
M. Every Lucene segment now stores a unique id per-segment and per-commit to aid in accurate replication of index files
N. The default norms format now uses sparse encoding when appropriate
O. Tokenizers and Analyzers no longer require Reader on init
P. All of the mentioned
Q. Scrunch's Java API is centered around three interfaces that represent distributed datasets
R. All of the other data transformation operations supported by the Crunch APIs are implemented in terms of three primitives
S. A number of common Aggregator<V> implementations are provided in the Aggregators class
T. All of the mentioned
U. Apache Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations
V. Hama is a Top Level Project under the Apache Software Foundation
W. BSP stands for Bulk Synchronous Parallel
X. All of the mentioned
Y. StreamPipeline executes the pipeline in-memory on the client
Z. MemPipeline executes the pipeline by converting it to a series of Spark pipelines
[. MapReduce framework approach makes it easy for the framework to serialize data from the client to the cluster
\. All of the mentioned
]. Building PyLucene requires GNU Make, a recent version of Ant capable of building Java Lucene and a C++ compiler
^. PyLucene is supported on Mac OS X, Linux, Solaris and Windows
_. Use of setuptools is recommended for Lucene
`. All of the mentioned
a. HCat provides connectors for MapReduce
b. Apache HCatalog provides table data access for CDH components such as Pig and MapReduce
c. HCat makes Hive metadata available to users of other Hadoop tools like Pig, MapReduce and Hive
d. All of the mentioned
e. Spark enables Apache Hive users to run their unmodified queries much faster
f. Spark interoperates only with Hadoop
g. Spark is a popular data warehouse solution running on top of Hadoop
h. None of the mentioned
i. In local mode, nothing must be launched via the start scripts
j. Distributed Mode is just like the "Pseudo Distributed Mode"
k. Apache Hama is one of the under-hyped projects in the Hadoop ecosystem
l. All of the mentioned
m. There is no guaranteed read consistency when a partition is dropped
n. Unpartitioned tables effectively have one default partition that must be created at table creation time
o. Once a partition is created, records cannot be added to it, removed from it, or updated in it
p. All of the mentioned
q. Thrift is developed for scalable cross-language services development
r. Thrift includes a complete stack for creating clients and servers
s. The top part of the Thrift stack is generated code from the Thrift definition
t. All of the mentioned
u. To create a Mahout service, one has to write Thrift files that describe it, generate the code in the destination language
v. Thrift is written in Java
w. Thrift is a lean and clean library
x. None of the mentioned
y. The HCatLoader and HCatStorer interfaces are used with Pig scripts to read and write data in HCatalog-managed tables
z. HCatalog is not thread safe
{. HCatLoader is used with Pig scripts to read data from HCatalog-managed tables.
|. All of the mentioned
}. RSS abstraction provides distributed task dispatching, scheduling, and basic I/O functionalities
~. For cluster manager, Spark supports standalone Hadoop YARN
. Hive SQL is a component on top of Spark Core
�. None of the mentioned
Answer: _________
Question 14:

. . . . . . . . phase merges the counts for unique ngrams or ngram fragments across multiple documents.

A. CollocCombiner
B. CollocReducer
C. CollocMerger
D. None of the mentioned
Answer: _________
Question 15:

Spark was initially started by . . . . . . . . at UC Berkeley AMPLab in 2009.

A. Mahek Zaharia
B. Matei Zaharia
C. Doug Cutting
D. Stonebraker
Answer: _________
Question 16:

Which of the following is a straightforward binary format?

A. TCompactProtocol
B. TDenseProtocol
C. TBinaryProtocol
D. TSimpleJSONProtocol
Answer: _________
Question 17:

Lucene index size is roughly . . . . . . . . the size of text indexed.

A. 10%
B. 20%
C. 50%
D. 70%
Answer: _________
Question 18:

Which of the following project is interface definition language for hadoop?

A. Oozie
B. Mahout
C. Thrift
D. Impala
Answer: _________
Question 19:

Which of the following apache project is gaining a lot of traction steadily with the efforts of its committers?

A. Hama
B. Hadoop
C. Hive
D. Pig
Answer: _________
Question 20:

Apache Hama provides complete clone of . . . . . . . .

A. Pragmatic
B. Pregel
C. ServePreg
D. All of the mentioned
Answer: _________
Question 21:

HCatalog maintains a cache of . . . . . . . . to talk to the metastore.

A. HiveServer
B. HiveClients
C. HCatClients
D. All of the mentioned
Answer: _________
Question 22:

. . . . . . . . is a human-readable text format to aid in debugging.

A. TMemory
B. TDebugProtocol
C. TBinaryProtocol
D. TSimpleJSONProtocol
Answer: _________
Question 23:

. . . . . . . . executes the pipeline as a series of MapReduce jobs.

A. SparkPipeline
B. MRPipeline
C. MemPipeline
D. None of the mentioned
Answer: _________
Question 24:

Lucene provides scalable, high-Performance indexing over . . . . . . . . per hour on modern hardware.

A. 1 TB
B. 150GB
C. 10 GB
D. None of the mentioned
Answer: _________
Question 25:

. . . . . . . . is responsible for maintaining groom server status.

A. GroomServers
B. BSPMaster
C. Zookeeper
D. All of the mentioned
Answer: _________
Question 26:

Hama requires JRE . . . . . . . . or higher and ssh to be set up between nodes in the cluster.

A. 1.6
B. 1.7
C. 1.8
D. 2.0
Answer: _________
Question 27:

Mahout provides . . . . . . . . libraries for common and primitive Java collections.

A. Java
B. Javascript
C. Perl
D. Python
Answer: _________
Question 28:

. . . . . . . . is a high performance search server built using Lucene Core.

A. Solr
B. Lucene Core
C. Lucy
D. PyLucene
Answer: _________
Question 29:

. . . . . . . . is a multi-threaded server using standard blocking I/O.

A. TNonblockingServer
B. TThreadPoolServer
C. TSimpleServer
D. None of the mentioned
Answer: _________
Question 30:

. . . . . . . . transport writes to a file.

A. TNonblockingServer
B. TFileTransport
C. TFramedTransport
D. TMemoryTransport
Answer: _________
Question 31:

. . . . . . . . leverages Spark Core fast scheduling capability to perform streaming analytics.

A. MLlib
B. Spark Streaming
C. GraphX
D. RDDs
Answer: _________
Question 32:

Crunch was designed for developers who understand . . . . . . . . and want to use MapReduce effectively.

A. Java
B. Python
C. Scala
D. Javascript
Answer: _________
Question 33:

. . . . . . . . is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

A. Lucene
B. Oozie
C. Lucy
D. All of the mentioned
Answer: _________
Question 34:

. . . . . . . . is a Python port of the Core project.

A. Solr
B. Lucene Core
C. Lucy
D. PyLucene
Answer: _________
Question 35:

Drill analyze semi-structured/nested data coming from . . . . . . . . applications.

A. RDBMS
B. NoSQL
C. NewSQL
D. None of the mentioned
Answer: _________
Question 36:

What is the role of the Oozie Coordinator EL functions in a coordinator workflow?

A. A distributed file system
B. A query language for Hadoop
C. Provide functions for evaluating expressions and conditions
D. A storage format in Hadoop
Answer: _________
Question 37:

In Oozie, what is the significance of the "precondition" element in a workflow action?

A. Sort records based on a column
B. Group records based on a condition
C. Join records from multiple tables
D. Define conditions that must be met before executing the action
Answer: _________
Question 38:

What is the purpose of the Oozie Sharelib in workflow automation?

A. Share and reuse libraries and resources across workflows
B. Execute MapReduce jobs
C. Manage computation resources
D. Perform data analytics in Hadoop
Answer: _________
Question 39:

In Oozie, what is the function of the "sla:info" property in a workflow action?

A. A distributed file system
B. A query language for Hadoop
C. A storage format in Hadoop
D. Specify Service Level Agreement (SLA) information for the action
Answer: _________
Question 40:

What is the significance of the Oozie Bundle Application in workflow automation?

A. Sort records based on a column
B. Group records based on a condition
C. Group and manage multiple workflows as a single unit
D. Join records from multiple tables
Answer: _________
Question 41:

In Oozie, what is the purpose of the "retry-max" property in a workflow action?

A. Manage computation resources
B. Specify the maximum number of retries for an action
C. Perform data analytics in Hadoop
D. None of the above
Answer: _________
Question 42:

What is the significance of the Oozie Kill node in a workflow?

A. A distributed file system
B. A query language for Hadoop
C. Terminate the workflow if a specific condition is met
D. A storage format in Hadoop
Answer: _________
Question 43:

In Oozie, what is the role of the "transition" element in a decision node?

A. Sort records based on a column
B. Group records based on a condition
C. Join records from multiple tables
D. Define the transitions based on conditions for decision node outcomes
Answer: _________
Question 44:

What is the purpose of the Oozie Sharelib Update tool?

A. Update and distribute Oozie Sharelib libraries
B. Execute MapReduce jobs
C. Manage computation resources
D. Perform data analytics in Hadoop
Answer: _________
Question 45:

In Oozie, what is the significance of the "ok to" and "error to" properties in a workflow?

A. A distributed file system
B. A query language for Hadoop
C. A storage format in Hadoop
D. Specify the next nodes to be executed based on the outcome of an action
Answer: _________
Question 46:

Spark is engineered from the bottom-up for performance, running . . . . . . . . faster than Hadoop by exploiting in memory computing and other optimizations.

A. 100x
B. 150x
C. 200x
D. None of the mentioned
Answer: _________
Question 47:

For . . . . . . . . partitioning jobs, simply specifying a custom directory is not good enough.

A. static
B. semi cluster
C. dynamic
D. all of the mentioned
Answer: _________
Question 48:

All file access uses Java's . . . . . . . . APIs which give Lucene stronger index safety.

A. NIO.2
B. NIO.3
C. NIO.4
D. NIO.5
Answer: _________
Question 49:

. . . . . . . . includes Apache Drill as part of the Hadoop distribution.

A. Impala
B. MapR
C. Oozie
D. All of the mentioned
Answer: _________
Question 50:

Hama was inspired by Google's . . . . . . . . large-scale graph computing framework.

A. Pragmatic
B. Pregel
C. Preghad
D. All of the mentioned
Answer: _________
Question 51:

A . . . . . . . . represents a distributed, immutable collection of elements of type T.

A. PCollect<T>
B. PCollection<T>
C. PCol<T>
D. All of the mentioned
Answer: _________
Question 52:

MapR . . . . . . . . Solution Earns Highest Score in Gigaom Research Data Warehouse Interoperability Report.

A. SQL-on-Hadoop
B. Hive-on-Hadoop
C. Pig-on-Hadoop
D. All of the mentioned
Answer: _________
Question 53:

Crunch uses Java serialization to serialize the contents of all of the . . . . . . . . in a pipeline definition.

A. Transient
B. DoFns
C. Configuration
D. All of the mentioned
Answer: _________
Question 54:

For Scala users, there is the . . . . . . . . API, which is built on top of the Java APIs.

A. Prunch
B. Scrunch
C. Hivench
D. All of the mentioned
Answer: _________
Question 55:

. . . . . . . . property allows us to specify a custom dir location pattern for all the writes, and will interpolate each variable.

A. hcat.dynamic.partitioning.custom.pattern
B. hcat.append.limit
C. hcat.pig.storer.external.location
D. hcatalog.hive.client.cache.expiry.time
Answer: _________
Question 56:

Spark architecture is . . . . . . . . times as fast as Hadoop disk-based Apache Mahout and even scales better than Vowpal Wabbit.

A. 10
B. 20
C. 50
D. 100
Answer: _________
Question 57:

Sally in data processing uses . . . . . . . . to cleanse and prepare the data.

A. Pig
B. Hive
C. HCatalog
D. Impala
Answer: _________
Question 58:

Groom servers starts up with a . . . . . . . . instance and an RPC proxy to contact the bsp master.

A. RPC
B. BSPPeer
C. LPC
D. None of the mentioned
Answer: _________
Question 59:

. . . . . . . . is used with Pig scripts to write data to HCatalog-managed tables.

A. HamaStorer
B. HCatStam
C. HCatStorer
D. All of the mentioned
Answer: _________
Question 60:

Mahout provides an implementation of a . . . . . . . . identification algorithm which scores collocations using log-likelihood ratio.

A. collocation
B. compaction
C. collection
D. none of the mentioned
Answer: _________
Question 61:

Hive version . . . . . . . . is the first release that includes HCatalog.

A. 0.10.0
B. 0.11.0
C. 0.12.0
D. All of the mentioned
Answer: _________
Question 62:

The Apache Crunch Java library provides a framework for writing, testing, and running . . . . . . . . pipelines.

A. MapReduce
B. Pig
C. Hive
D. None of the mentioned
Answer: _________
Question 63:

How many types of modes are present in Hama?

A. 2
B. 3
C. 4
D. 5
Answer: _________
Question 64:

. . . . . . . . provides Java-based indexing and search technology.

A. Solr
B. Lucene Core
C. Lucy
D. All of the mentioned
Answer: _________
Question 65:

The first call on the HCatOutputFormat must be . . . . . . . .

A. setOutputSchema
B. setOutput
C. setOut
D. OutputSchema
Answer: _________
Question 66:

. . . . . . . . accepts a table to read data from and optionally a selection predicate to indicate which partitions to scan.

A. HCatOutputFormat
B. HCatInputFormat
C. OutputFormat
D. InputFormat
Answer: _________
Question 67:

The top-level . . . . . . . . package contains three of the most important specializations in Crunch.

A. org.apache.scrunch
B. org.apache.crunch
C. org.apache.kcrunch
D. all of the mentioned
Answer: _________
Question 68:

. . . . . . . . represent the logical computations of your Crunch pipelines.

A. DoFns
B. DoFn
C. ThreeFns
D. None of the mentioned
Answer: _________
Question 69:

The web UI provides information about . . . . . . . . job statistics of the Hama cluster.

A. MPP
B. BSP
C. USP
D. ISP
Answer: _________
Question 70:

. . . . . . . . is used when you want the sink to be the input source for another operation.

A. Collector Tier Event
B. Agent Tier Event
C. Basic
D. All of the mentioned
Answer: _________
Question 71:

Distributed Mode are mapped in the . . . . . . . . file.

A. groomservers
B. grervers
C. grsvers
D. groom
Answer: _________
Question 72:

Apache Flume 1.3.0 is the fourth release under the auspices of Apache of the so-called . . . . . . . . codeline.

A. NG
B. ND
C. NF
D. NR
Answer: _________
Question 73:

A . . . . . . . . is an operation on the stream that can transform the stream.

A. Decorator
B. Source
C. Sinks
D. All of the mentioned
Answer: _________
Question 74:

. . . . . . . . uses memory for I/O in Thrift.

A. TZlibTransport
B. TFramedTransport
C. TMemoryTransport
D. None of the mentioned
Answer: _________
Question 75:

HCatalog supports reading and writing files in any format for which a . . . . . . . . can be written.

A. SerDE
B. SaerDear
C. DocSear
D. All of the mentioned
Answer: _________
Question 76:

New . . . . . . . . type enables Indexing and searching of date ranges, particularly multi-valued ones.

A. RangeField
B. DateField
C. DateRangeField
D. All of the mentioned
Answer: _________
Question 77:

Spark is packaged with higher level libraries, including support for . . . . . . . . queries.

A. SQL
B. C
C. C++
D. None of the mentioned
Answer: _________
Question 78:

What is the primary role of the Oozie Coordinator Action?

A. Sort records based on a column
B. Group records based on a condition
C. Represent an individual action within a coordinator workflow
D. Join records from multiple tables
Answer: _________
Question 79:

In Oozie, what does the term "data-in" represent in a coordinator workflow?

A. Manage computation resources
B. Specify the input data for a coordinator workflow
C. Perform data analytics in Hadoop
D. None of the above
Answer: _________
Question 80:

What is the significance of the Oozie Subworkflow in workflow automation?

A. A distributed file system
B. A query language for Hadoop
C. Embed a subworkflow within a main workflow for modular design
D. A storage format in Hadoop
Answer: _________
Question 81:

In Oozie, what is the purpose of the "start" and "end" nodes in a workflow?

A. Sort records based on a column
B. Group records based on a condition
C. Join records from multiple tables
D. Indicate the beginning and end of a workflow
Answer: _________
Question 82:

What is the role of the Oozie SLA Alert in workflow automation?

A. Send alerts based on Service Level Agreement (SLA) violations
B. A storage format in Hadoop
C. A query language for Hadoop
D. A distributed file system
Answer: _________
Question 83:

In Oozie, what is the primary function of the Oozie Bundle Application Coordinator?

A. Perform data analytics in Hadoop
B. Manage computation resources
C. Execute MapReduce jobs
D. Define and manage a bundle of coordinated workflows
Answer: _________
Question 84:

What is the significance of the Oozie Fork and Join nodes in a workflow?

A. Sort records based on a column
B. Group records based on a condition
C. Parallelize and synchronize the execution of multiple actions
D. Join records from multiple tables
Answer: _________
Question 85:

In Oozie, what does the term "end" represent in a workflow action?

A. Manage computation resources
B. Specify the final state or outcome of an action
C. Perform data analytics in Hadoop
D. None of the above
Answer: _________
Question 86:

What is the purpose of the Oozie Sharelib Create tool?

A. A distributed file system
B. A query language for Hadoop
C. Create and distribute Oozie Sharelib libraries
D. A storage format in Hadoop
Answer: _________
Question 87:

In Oozie, what is the role of the "timeout" property in a workflow action?

A. Sort records based on a column
B. Group records based on a condition
C. Join records from multiple tables
D. Define the maximum execution time for an action
Answer: _________
Question 88:

What is Oozie in the context of Hadoop?

A. A workflow scheduler for Hadoop jobs
B. A distributed storage system
C. A query language for Hadoop
D. A data processing engine for Hadoop
Answer: _________
Question 89:

In Oozie, what is the primary purpose of a coordinator?

A. Perform data analytics in Hadoop
B. Execute MapReduce jobs
C. Manage computation resources
D. Define and manage recurrent workflows
Answer: _________
Question 90:

What is an Oozie Bundle in the context of workflow automation?

A. A storage format in Hadoop
B. A query language for Hadoop
C. A collection of coordinated workflows
D. A distributed file system
Answer: _________
Question 91:

In Oozie, what does the term "action" refer to?

A. A query language for Hadoop
B. A unit of work in a workflow
C. A distributed file system
D. None of the above
Answer: _________
Question 92:

What is the purpose of the Oozie Coordinator Dataset?

A. Join records from multiple tables
B. Group records based on a condition
C. Define the data set for a coordinator
D. Sort records based on a column
Answer: _________
Question 93:

In Oozie, what is the role of the Oozie Workflow XML file?

A. A storage format in Hadoop
B. A query language for Hadoop
C. A distributed file system
D. Define the structure and execution flow of a workflow
Answer: _________
Question 94:

What is the significance of the Oozie Decision node in a workflow?

A. Conditionally control the flow of the workflow
B. Execute MapReduce jobs
C. Manage computation resources
D. Perform data analytics in Hadoop
Answer: _________
Question 95:

In Oozie, what does the term "bundle application coordinator" mean?

A. A distributed file system
B. A query language for Hadoop
C. A storage format in Hadoop
D. A coordinator for managing a bundle of workflows
Answer: _________
Question 96:

What is the primary role of the Oozie Action Executor?

A. Sort records based on a column
B. Group records based on a condition
C. Execute individual actions in a workflow
D. Join records from multiple tables
Answer: _________
Question 97:

In Oozie, what is the purpose of the "capture-output" property in a workflow action?

A. Manage computation resources
B. Capture the standard output of an action
C. Perform data analytics in Hadoop
D. None of the above
Answer: _________
Question 98:

. . . . . . . . is a REST API for HCatalog.

A. WebHCat
B. WbHCat
C. InpHCat
D. None of the mentioned
Answer: _________
Question 99:

A . . . . . . . . server and a data node should be run on one physical node.

A. groom
B. web
C. client
D. all of the mentioned
Answer: _________
Question 100:

Which of the following is a multi-threaded server using non-blocking I/O?

A. TNonblockingServer
B. TSimpleServer
C. TSocket
D. None of the mentioned
Answer: _________
Question 101:

Users can easily run Spark on top of Amazon's . . . . . . . .

A. Infosphere
B. EC2
C. EMR
D. None of the mentioned
Answer: _________
Question 102:

A . . . . . . . . is used to manage the efficient barrier synchronization of the BSPPeers.

A. GroomServers
B. BSPMaster
C. Zookeeper
D. None of the mentioned
Answer: _________
Question 103:

. . . . . . . . uses blocking socket I/O for transport.

A. TNonblockingServer
B. TSimpleServer
C. TSocket
D. None of the mentioned
Answer: _________
Question 104:

The . . . . . . . . class defines a configuration parameter named LINES_PER_MAP that controls how the input file is split.

A. NLineInputFormat
B. InputLineFormat
C. LineInputFormat
D. None of the mentioned
Answer: _________
Question 105:

Which of the following Uses JSON for encoding of data?

A. TCompactProtocol
B. TDenseProtocol
C. TBinaryProtocol
D. None of the mentioned
Answer: _________
Question 106:

Which of the following performs compression using zlib?

A. TZlibTransport
B. TFramedTransport
C. TMemoryTransport
D. None of the mentioned
Answer: _________
Question 107:

Inline DoFn that splits a line up into words is an inner class . . . . . . . .

A. Pipeline
B. MyPipeline
C. ReadPipeline
D. WritePipe
Answer: _________
Question 108:

The HCatalog interface for Pig consists of . . . . . . . . and HCatStorer, which implement the Pig load and store interfaces respectively.

A. HCLoader
B. HCatLoader
C. HCatLoad
D. None of the mentioned
Answer: _________
Question 109:

. . . . . . . . is a subproject with the aim of collecting and distributing free materials.

A. OSR
B. OPR
C. ORP
D. ORS
Answer: _________
Question 110:

. . . . . . . . does not restrict contributions to Hadoop based implementations.

A. Mahout
B. Oozie
C. Impala
D. All of the mentioned
Answer: _________
Question 111:

A number of . . . . . . . . source adapters give you the granular control to grab a specific file.

A. multimedia file
B. text file
C. image file
D. none of the mentioned
Answer: _________
Question 112:

The Lucene . . . . . . . . is pleased to announce the availability of Apache Lucene 5.0.0 and Apache Solr 5.0.0.

A. PMC
B. RPC
C. CPM
D. All of the mentioned
Answer: _________
Question 113:

The output descriptor for the table to be written is created by calling . . . . . . . .

A. OutputJobInfo.describe
B. OutputJobInfo.create
C. OutputJobInfo.put
D. None of the mentioned
Answer: _________
Question 114:

. . . . . . . . is a component on top of Spark Core.

A. Spark Streaming
B. Spark SQL
C. RDDs
D. All of the mentioned
Answer: _________
Question 115:

Heap usage during IndexWriter merging is also much lower with the new . . . . . . . .

A. LucCodec
B. Lucene50Codec
C. Lucene20Cod
D. All of the mentioned
Answer: _________
Question 116:

Drill is designed from the ground up to support high-performance analysis on the . . . . . . . . data.

A. semi-structured
B. structured
C. unstructured
D. none of the mentioned
Answer: _________
Question 117:

On the write side, it is expected that the user pass in valid . . . . . . . . with data correctly.

A. HRecords
B. HCatRecos
C. HCatRecords
D. None of the mentioned
Answer: _________
Question 118:

Hama consist of mainly . . . . . . . . components for large scale processing of graphs.

A. two
B. three
C. four
D. five
Answer: _________
Question 119:

. . . . . . . . is the default mode if you download Hama.

A. Local Mode
B. Pseudo Distributed Mode
C. Distributed Mode
D. All of the mentioned
Answer: _________
Question 120:

Flume deploys as one or more agents, each contained within its own instance of . . . . . . . .

A. JVM
B. Channels
C. Chunks
D. None of the mentioned
Answer: _________
Question 121:

Which of the following is a more compact binary format?

A. TCompactProtocol
B. TDenseProtocol
C. TBinaryProtocol
D. TSimpleJSONProtocol
Answer: _________
Question 122:

The Crunch APIs are modeled after . . . . . . . . which is the library that Google uses for building data pipelines on top of their own implementation of MapReduce.

A. FlagJava
B. FlumeJava
C. FlakeJava
D. All of the mentioned
Answer: _________
Question 123:

Which of the following format is similar to TCompactProtocol?

A. TCompactProtocol
B. TDenseProtocol
C. TBinaryProtocol
D. TSimpleJSONProtocol
Answer: _________
Question 124:

A . . . . . . . . in a social graph is a group of people who interact frequently with each other and less frequently with others.

A. semi-cluster
B. partial cluster
C. full cluster
D. none of the mentioned
Answer: _________
Question 125:

. . . . . . . . can be used to generate stats over the results of arbitrary numeric functions.

A. stats.field
B. sta.field
C. stats.value
D. none of the mentioned
Answer: _________
Question 126:

PCollection, PTable, and PGroupedTable all support a . . . . . . . . operation.

A. intersection
B. union
C. OR
D. None of the mentioned
Answer: _________
Question 127:

. . . . . . . . was created to allow you to flow data from a source into your Hadoop environment.

A. Imphala
B. Oozie
C. Flume
D. All of the mentioned
Answer: _________
Question 128:

Spark includes a collection over . . . . . . . . operators for transforming data and familiar data frame APIs for manipulating semi-structured data.

A. 50
B. 60
C. 70
D. 80
Answer: _________
Question 129:

Hama is a general . . . . . . . . computing engine on top of Hadoop.

A. BSP
B. ASP
C. MPP
D. None of the mentioned
Answer: _________
Question 130:

. . . . . . . . method is used to include a projection schema, to specify the output fields.

A. OutputSchema
B. setOut
C. setOutputSchema
D. none of the mentioned
Answer: _________
Question 131:

Drill provides a . . . . . . . . like internal data model to represent and process data.

A. XML
B. JSON
C. TIFF
D. None of the mentioned
Answer: _________
Question 132:

The Avros class also has a . . . . . . . . method for creating PTypes for POJOs using Avro's reflection-based serialization mechanism.

A. spot
B. reflects
C. gets
D. all of the mentioned
Answer: _________
Question 133:

A key of type . . . . . . . . is generated which is used later to join ngrams with their heads and tails in the reducer phase.

A. GramKey
B. Primary
C. Secondary
D. None of the mentioned
Answer: _________
Question 134:

During merging, . . . . . . . . now always checks the incoming segments for corruption before merging.

A. LocalWriter
B. IndexWriter
C. ReadWriter
D. All of the mentioned
Answer: _________
Question 135:

. . . . . . . . is a distributed graph processing framework on top of Spark.

A. MLlib
B. Spark Streaming
C. GraphX
D. All of the mentioned
Answer: _________
Question 136:

Apache . . . . . . . . provides direct queries on self-describing and semi-structured data in files.

A. Drill
B. Mahout
C. Oozie
D. All of the mentioned
Answer: _________
Question 137:

. . . . . . . . mode is used when you just have a single server and want to launch all the daemon processes.

A. Local Mode
B. Pseudo Distributed Mode
C. Distributed Mode
D. All of the mentioned
Answer: _________
Question 138:

. . . . . . . . is a distributed machine learning framework on top of Spark.

A. MLlib
B. Spark Streaming
C. GraphX
D. RDDs
Answer: _________
Question 139:

Hive does not have a data type corresponding to the . . . . . . . . type in Pig.

A. decimal
B. short
C. biginteger
D. datetime
Answer: _________
Question 140:

You can write to a single partition by specifying the partition key(s) and value(s) in the . . . . . . . . method.

A. setOutput
B. setOut
C. put
D. get
Answer: _________
Question 141:

DoFns provide direct access to the . . . . . . . . object that is used within a given Map or Reduce task via the getContext method.

A. TaskInputContext
B. TaskInputOutputContext
C. TaskOutputContext
D. All of the mentioned
Answer: _________
Question 142:

GraphX provides an API for expressing graph computation that can model the . . . . . . . . abstraction.

A. GaAdt
B. Spark Core
C. Pregel
D. None of the mentioned
Answer: _________
Question 143:

The . . . . . . . . class allows developers to exercise precise control over how data is partitioned, sorted, and grouped by the underlying execution engine.

A. Grouping
B. GroupingOptions
C. RowGrouping
D. None of the mentioned
Answer: _________
Question 144:

Hive, Pig, and Cascading all use a . . . . . . . . data model.

A. value centric
B. columnar
C. tuple-centric
D. none of the mentioned
Answer: _________
Question 145:

Which of the following language is not supported by Spark?

A. Java
B. Pascal
C. Scala
D. Python
Answer: _________
Question 146:

With HCatalog . . . . . . . . does not need to modify the table structure.

A. Partition
B. Columns
C. Robert
D. All of the mentioned
Answer: _________
Question 147:

. . . . . . . . is a single-threaded server using standard blocking I/O.

A. TNonblockingServer
B. TSimpleServer
C. TSocket
D. None of the mentioned
Answer: _________
Question 148:

The HCatalog . . . . . . . . supports all Hive DDL that does not require MapReduce to execute.

A. Powershell
B. CLI
C. CMD
D. All of the mentioned
Answer: _________
Question 149:

Spark SQL provides a domain-specific language to manipulate . . . . . . . . in Scala, Java, or Python.

A. Spark Streaming
B. Spark SQL
C. RDDs
D. All of the mentioned
Answer: _________
Question 150:

Drill integrates with BI tools using a standard . . . . . . . . connector.

A. JDBC
B. ODBC
C. ODBC-JDBC
D. All of the mentioned
Answer: _________
Question 151:

The . . . . . . . . collocation identifier is integrated into the process that is used to create vectors from sequence files of text keys and values.

A. lbr
B. lcr
C. llr
D. lar
Answer: _________
Question 152:

HCatalog supports the same data types as . . . . . . . .

A. Pig
B. Hama
C. Hive
D. Oozie
Answer: _________
Question 153:

. . . . . . . . is the type supported for storing values in HCatalog tables.

A. HCatRecord
B. HCatColumns
C. HCatValues
D. All of the mentioned
Answer: _________
Question 154:

PostingsFormat now uses a . . . . . . . . API when writing postings, just like doc values.

A. push
B. pull
C. read
D. all of the mentioned
Answer: _________
Question 155:

Spark powers a stack of high-level tools including Spark SQL, MLlib for . . . . . . . .

A. regression models
B. statistics
C. machine learning
D. reproductive research
Answer: _________
Question 156:

. . . . . . . . is a write-only protocol that cannot be parsed by Thrift.

A. TCompactProtocol
B. TDenseProtocol
C. TBinaryProtocol
D. TSimpleJSONProtocol
Answer: _________
Question 157:

HCatalog is built on top of the Hive metastore and incorporates Hive's is . . . . . . . .

A. DDL
B. DML
C. TCL
D. DCL
Answer: _________
Question 158:

. . . . . . . . is used as a remote procedure call (RPC) framework for facebook.

A. Oozie
B. Mahout
C. Thrift
D. Impala
Answer: _________
Question 159:

. . . . . . . . property allow users to override the expiry time specified.

A. hcat.desired.partition.num.splits
B. hcatalog.hive.client.cache.expiry.time
C. hcatalog.hive.client.cache.disabled
D. hcat.append.limit
Answer: _________
Question 160:

. . . . . . . . is where you would land a flow (or possibly multiple flows joined together) into an HDFS-formatted file system.

A. Collector Tier Event
B. Agent Tier Event
C. Basic
D. All of the mentioned
Answer: _________

Answer Key

1: D, E, I, M, T, X, \, ], d, h, i, p, s, v, |, �
2: B
3: A
4: B
5: B
6: B
7: B
8: B
9: B
10: C
11: B
12: B
13: D, H, I, P, S, X, [, `, b, e, j, p, t, w, |, ~
14: A
15: B
16: C
17: B
18: C
19: A
20: B
21: B
22: B
23: B
24: B
25: B
26: A
27: A
28: A
29: B
30: B
31: B
32: A
33: A
34: D
35: B
36: C
37: D
38: A
39: D
40: C
41: B
42: C
43: D
44: A
45: D
46: A
47: C
48: A
49: B
50: B
51: B
52: A
53: B
54: B
55: A
56: A
57: A
58: B
59: C
60: A
61: B
62: A
63: B
64: B
65: B
66: B
67: B
68: A
69: B
70: B
71: A
72: A
73: B
74: C
75: A
76: C
77: A
78: C
79: B
80: C
81: D
82: A
83: D
84: C
85: B
86: C
87: D
88: A
89: D
90: C
91: B
92: C
93: D
94: A
95: D
96: C
97: B
98: A
99: A
100: A
101: B
102: C
103: C
104: A
105: D
106: A
107: B
108: B
109: C
110: A
111: B
112: A
113: B
114: B
115: B
116: A
117: C
118: B
119: A
120: A
121: A
122: B
123: B
124: A
125: A
126: B
127: C
128: D
129: A
130: C
131: B
132: B
133: A
134: B
135: C
136: A
137: B
138: A
139: C
140: A
141: B
142: C
143: B
144: C
145: B
Solution: Apache Spark is a powerful open-source processing engine for big data analytics and supports multiple programming languages. Option A: Java - This is supported by Spark. Spark has a comprehensive API for Java. Option B: Pascal - This is not supported by Spark. Pascal is not one of the languages for which Spark provides APIs. Option C: Scala - This is supported by Spark. Spark is written in Scala and provides a robust API for it. Option D: Python - This is supported by Spark. Spark has a popular API for Python known as PySpark. Therefore, the correct answer is Option B: Pascal .
146: C
147: B
148: B
149: C
150: B
151: C
152: C
153: A
154: B
155: C
156: D
157: A
158: C
159: B
160: A